<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/">
    <channel>
        <title>Elastic Observability Labs - Log Analytics</title>
        <link>https://www.elastic.co/observability-labs</link>
        <description>Trusted security news &amp; research from the team at Elastic.</description>
        <lastBuildDate>Mon, 08 Jun 2026 15:18:17 GMT</lastBuildDate>
        <docs>https://validator.w3.org/feed/docs/rss2.html</docs>
        <generator>https://github.com/jpmonette/feed</generator>
        <image>
            <title>Elastic Observability Labs - Log Analytics</title>
            <url>https://www.elastic.co/observability-labs/assets/observability-labs-thumbnail.png</url>
            <link>https://www.elastic.co/observability-labs</link>
        </image>
        <copyright>© 2026. Elasticsearch B.V. All Rights Reserved</copyright>
        <item>
            <title><![CDATA[3 models for logging with OpenTelemetry and Elastic]]></title>
            <link>https://www.elastic.co/observability-labs/blog/3-models-logging-opentelemetry</link>
            <guid isPermaLink="false">3-models-logging-opentelemetry</guid>
            <pubDate>Tue, 27 Jun 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[Because OpenTelemetry increases usage of tracing and metrics with developers, logging continues to provide flexible, application-specific, and event-driven data. Explore OpenTelemetry logging and how it provides guidance on the available approaches.]]></description>
            <content:encoded><![CDATA[<p>Arguably, <a href="https://www.elastic.co/blog/opentelemetry-observability">OpenTelemetry</a> exists to (greatly) increase usage of tracing and metrics among developers. That said, logging will continue to play a critical role in providing flexible, application-specific, event-driven data. Further, OpenTelemetry has the potential to bring added value to existing application logging flows:</p>
<ol>
<li>
<p>Common metadata across tracing, metrics, and logging to facilitate contextual correlation, including metadata passed between services as part of REST or RPC APIs; this is a critical element of service observability in the age of distributed, horizontally scaled systems</p>
</li>
<li>
<p>An optional unified data path for tracing, metrics, and logging to facilitate common tooling and signal routing to your observability backend</p>
</li>
</ol>
<p>Adoption of metrics and tracing among developers to date has been relatively small. Further, the number of proprietary vendors and APIs (compared to adoption rate) is relatively large. As such, OpenTelemetry took a greenfield approach to developing new, vendor-agnostic APIs for tracing and metrics. In contrast, most developers have nearly 100% log coverage across their services. Moreover, logging is largely supported by a small number of vendor-agnostic, open-source logging libraries and associated APIs (e.g., <a href="https://logback.qos.ch">Logback</a> and <a href="https://learn.microsoft.com/en-us/dotnet/api/microsoft.extensions.logging.ilogger">ILogger</a>). As such, <a href="https://opentelemetry.io/docs/specs/otel/logs/#introduction">OpenTelemetry’s approach to logging</a> meets developers where they already are using hooks into existing, popular logging frameworks. In this way, developers can add OpenTelemetry as a log signal output without otherwise altering their code and investment in logging as an observability signal.</p>
<p>Notably, logging is the least mature of OTel supported observability signals. Depending on your service’s <a href="https://opentelemetry.io/docs/instrumentation/#status-and-releases">language</a>, and your appetite for adventure, there exist several options for exporting logs from your services and applications and marrying them together in your observability backend.</p>
<p>The intent of this article is to explore the current state of the art of <a href="https://www.elastic.co/blog/introduction-apm-tracing-logging-customer-experience">OpenTelemetry logging</a> and to provide guidance on the available approaches with the following tenants in mind:</p>
<ul>
<li>Correlation of service logs with OTel-generated tracing where applicable</li>
<li>Proper capture of exceptions</li>
<li>Common context across tracing, metrics, and logging</li>
<li>Support for <a href="https://www.slf4j.org/manual.html#fluent">slf4j key-value pairs</a> (“structured logging”)</li>
<li>Automatic attachment of metadata carried between services via <a href="https://opentelemetry.io/docs/concepts/signals/baggage/">OTel baggage</a></li>
<li>Use of an Elastic&lt;sup&gt;®&lt;/sup&gt; Observability backend</li>
<li>Consistent data fidelity in Elastic regardless of the approach taken</li>
</ul>
<h2>OpenTelemetry logging models</h2>
<p>Three models currently exist for getting your application or service logs to Elastic with correlation to OTel tracing and baggage:</p>
<ol>
<li>
<p>Output logs from your service (alongside traces and metrics) using an embedded <a href="https://opentelemetry.io/docs/instrumentation/#status-and-releases">OpenTelemetry Instrumentation library</a> to Elastic via the OTLP protocol</p>
</li>
<li>
<p>Write logs from your service to a file scraped by the <a href="https://opentelemetry.io/docs/collector/">OpenTelemetry Collector</a>, which then forwards to Elastic via the OTLP protocol</p>
</li>
<li>
<p>Write logs from your service to a file scraped by <a href="https://www.elastic.co/elastic-agent">Elastic Agent</a> (or <a href="https://www.elastic.co/beats/filebeat">Filebeat</a>), which then forwards to Elastic via an Elastic-defined protocol</p>
</li>
</ol>
<p>Note that (1), in contrast to (2) and (3), does not involve writing service logs to a file prior to ingestion into Elastic.</p>
<h2>Logging vs. span events</h2>
<p>It is worth noting that most APM systems, including OpenTelemetry, include provisions for <a href="https://opentelemetry.io/docs/instrumentation/ruby/manual/#add-span-events">span events</a>. Like log statements, span events contain arbitrary, textual data. Additionally, span events automatically carry any custom attributes (e.g., a “user ID”) applied to the parent span, which can help with correlation and context. In this regard, it may be advantageous to translate some existing log statements (inside spans) to span events. As the name implies, of course, span events can only be emitted from within a span and thus are not intended to be a general purpose replacement for logging.</p>
<p>Unlike logging, span events do not pass through existing logging frameworks and therefore cannot (practically) be written to a log file. Further, span events are technically emitted as part of trace data and follow the same data path and signal routing as other trace data.</p>
<h2>Polyfill appender</h2>
<p>Some of the demos make use of a custom Logback <a href="https://github.com/ty-elastic/otel-logging/blob/main/java-otel-log/src/main/java/com/tb93/otel/batteries/PolyfillAppender.java">“Polyfill appender”</a> (inspired by OTel’s <a href="https://github.com/open-telemetry/opentelemetry-java-instrumentation/tree/main/instrumentation/logback/logback-mdc-1.0/library">Logback MDC</a>), which provides support for attaching <a href="https://www.slf4j.org/manual.html#fluent">slf4j key-value pairs</a> to log messages for models (2) and (3).</p>
<h2>Elastic Common Schema</h2>
<p>For log messages to exhibit full fidelity within Elastic, they eventually need to be formatted in accordance with the <a href="https://www.elastic.co/guide/en/ecs/current/ecs-reference.html">Elastic Common Schema</a> (ECS). In models (1) and (2), log messages remain formatted in OTel log semantics until ingested by the Elastic APM Server. The Elastic APM Server then translates OTel log semantics to ECS. In model (3), ECS is applied at the source.</p>
<p>Notably, OpenTelemetry recently <a href="https://www.elastic.co/blog/ecs-elastic-common-schema-otel-opentelemetry-announcement">adopted the Elastic Common Schema</a> as its standard for semantic conventions going forward! As such, it is anticipated that current OTel log semantics will be updated to align with ECS.</p>
<h2>Getting started</h2>
<p>The included demos center around a “POJO” (no assumed framework) Java project. Java is arguably the most mature of OTel-supported languages, particularly with respect to logging options. Notably, this singular Java project was designed to support the three models of logging discussed here. In practice, you would only implement one of these models (and corresponding project dependencies).</p>
<p>The demos assume you have a working <a href="https://www.docker.com/">Docker</a> environment and an <a href="https://www.elastic.co/cloud/">Elastic Cloud</a> instance.</p>
<ol>
<li>
<p>git clone <a href="https://github.com/ty-elastic/otel-logging">https://github.com/ty-elastic/otel-logging</a></p>
</li>
<li>
<p>Create an .env file at the root of otel-logging with the following (appropriately filled-in) environment variables:</p>
</li>
</ol>
<pre><code class="language-bash"># the service name
OTEL_SERVICE_NAME=app4

# Filebeat vars
ELASTIC_CLOUD_ID=(see https://www.elastic.co/guide/en/beats/metricbeat/current/configure-cloud-id.html)
ELASTIC_CLOUD_AUTH=(see https://www.elastic.co/guide/en/beats/metricbeat/current/configure-cloud-id.html)

# apm vars
ELASTIC_APM_SERVER_ENDPOINT=(address of your Elastic Cloud APM server... i.e., https://xyz123.apm.us-central1.gcp.cloud.es.io:443)
ELASTIC_APM_SERVER_SECRET=(see https://www.elastic.co/guide/en/apm/guide/current/secret-token.html)
</code></pre>
<ol start="3">
<li>Start up the demo with the desired model:</li>
</ol>
<ul>
<li>If you want to demo logging via OTel APM Agent, run MODE=apm docker-compose up</li>
<li>If you want to demo logging via OTel filelogreceiver, run MODE=filelogreceiver docker-compose up</li>
<li>If you want to demo logging via Elastic filebeat, run MODE=filebeat docker-compose up</li>
</ul>
<ol start="4">
<li>Validate incoming span and correlated log data in your Elastic Cloud instance</li>
</ol>
<h2>Model 1: Logging via OpenTelemetry instrumentation</h2>
<p>This model aligns with the long-term goals of OpenTelemetry: <a href="https://opentelemetry.io/docs/specs/otel/logs/#opentelemetry-solution">integrated tracing, metrics, and logging (with common attributes) from your services</a> via the <a href="https://opentelemetry.io/docs/instrumentation/#status-and-releases">OpenTelemetry Instrumentation libraries</a>, without dependency on log files and scrappers.</p>
<p>In this model, your service generates log statements as it always has, using popular logging libraries (e.g., <a href="https://logback.qos.ch">Logback</a> for Java). OTel provides a “Southbound hook” to Logback via the OTel <a href="https://github.com/open-telemetry/opentelemetry-java-instrumentation/tree/main/instrumentation/logback/logback-appender-1.0/library">Logback Appender</a>, which injects ServiceName, SpanID, TraceID, slf4j key-value pairs, and OTel baggage into log records and passes the composed records to the co-resident OpenTelemetry Instrumentation library. We further employ a <a href="https://github.com/ty-elastic/otel-logging/blob/main/java-otel-log/src/main/java/com/tb93/otel/batteries/AddBaggageLogProcessor.java">custom LogRecordProcessor</a> to add baggage to the log record as attributes.</p>
<p>The OTel instrumentation library then formats the log statements per the <a href="https://opentelemetry.io/docs/specs/otel/logs/data-model/">OTel logging spec</a> and ships them via OTLP to either an OTel Collector for further routing and enrichment or directly to Elastic.</p>
<p>Notably, as language support improves, this model can and will be supported by runtime agent binding with auto-instrumentation where available (e.g., no code changes required for runtime languages).</p>
<p>One distinguishing advantage of this model, beyond the simplicity it affords, is the ability to more easily tie together attributes and tracing metadata directly with log statements. This inherently makes logging more useful in the context of other OTel-supported observability signals.</p>
<h3>Architecture</h3>
<p><img src="https://www.elastic.co/observability-labs/assets/images/3-models-logging-opentelemetry/elastic-blog-model-1-architecture.png" alt="model 1 architecture" /></p>
<p>Although not explicitly pictured, an <a href="https://opentelemetry.io/docs/collector/">OpenTelemetry Collector</a> can be inserted in between the service and Elastic to facilitate additional enrichment and/or signal routing or duplication across observability backends.</p>
<h3>Pros</h3>
<ul>
<li>Simplified signal architecture and fewer “moving parts” (no files, disk utilization, or file rotation concerns)</li>
<li>Aligns with long-term OTel vision</li>
<li>Log statements can be (easily) decorated with OTel metadata</li>
<li>No polyfill adapter required to support structured logging with slf4j</li>
<li>No additional collectors/agents required</li>
<li>Conversion to ECS happens within Elastic keeping log data vendor-agnostic until ingestion</li>
<li>Common wireline protocol (OTLP) across tracing, metrics, and logs</li>
</ul>
<h3>Cons</h3>
<ul>
<li>Not available (yet) in many OTel-supported languages</li>
<li>No intermediate log file for ad-hoc, on-node debugging</li>
<li>Immature (alpha/experimental)
Unknown “glare” conditions, which could result in loss of log data if service exits prematurely or if the backend is unable to accept log data for an extended period of time</li>
</ul>
<h3>Demo</h3>
<p>MODE=apm docker-compose up</p>
<h2>Model 2: Logging via the OpenTelemetry Collector</h2>
<p>Given the cons of Model 1, it may be advantageous to consider a model that continues to leverage an actual log file intermediary between your services and your observability backend. Such a model is possible using an <a href="https://opentelemetry.io/docs/collector/">OpenTelemetry Collector</a> collocated with your services (e.g., on the same host), running the <a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/filelogreceiver/README.md">filelogreceiver</a> to scrape service log files.</p>
<p>In this model, your service generates log statements as it always has, using popular logging libraries (e.g., <a href="https://logback.qos.ch">Logback</a> for Java). OTel provides a MDC Appender for Logback (<a href="https://github.com/open-telemetry/opentelemetry-java-instrumentation/tree/main/instrumentation/logback/logback-mdc-1.0/library">Logback MDC</a>), which adds SpanID, TraceID, and Baggage to the <a href="https://logback.qos.ch/manual/mdc.html">Logback MDC context</a>.</p>
<p>Notably, no log record structure is assumed by the OTel filelogreceiver. In the example provided, we employ the <a href="https://github.com/logfellow/logstash-logback-encoder">logstash-logback-encoder</a> to JSON-encode log messages. The logstash-logback-encoder will read the OTel SpanID, TraceID, and Baggage off the MDC context and encode it into the JSON structure. Notably, logstash-logback-encoder doesn’t explicitly support <a href="https://www.slf4j.org/manual.html#fluent">slf4j key-value pairs</a>. It does, however, support <a href="https://github.com/logfellow/logstash-logback-encoder#event-specific-custom-fields">Logback structured arguments</a>, and thus I use the <a href="https://github.com/ty-elastic/otel-logging/blob/main/java-otel-log/src/main/java/com/tb93/otel/batteries/PolyfillAppender.java">Polyfill Appender</a> to convert slf4j key-value pairs to Logback structured arguments.</p>
<p>From there, we write the log lines to a log file. If you are using Kubernetes or other container orchestration in your environment, you would more typically write to stdout (console) and let the orchestration log driver write to and manage log files.</p>
<p>We then <a href="https://github.com/ty-elastic/otel-logging/blob/main/collector/filelogreceiver.yml">configure</a> the OTel Collector to scrape this log file (using the filelogreceiver). Because no assumptions are made about the format of the log lines, you need to <a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/pkg/stanza/docs/types/parsers.md#parsers">explicitly map fields</a> from your log schema to the OTel log schema.</p>
<p>From there, the OTel Collector batches and ships the formatted log lines via OTLP to Elastic.</p>
<h3>Architecture</h3>
<p><img src="https://www.elastic.co/observability-labs/assets/images/3-models-logging-opentelemetry/elastic-blog-model-2-architecture.png" alt="model 2 architecture" /></p>
<h3>Pros</h3>
<ul>
<li>Easy to debug (you can manually read the intermediate log file)</li>
<li>Inherent file-based FIFO buffer</li>
<li>Less susceptible to “glare” conditions when service prematurely exits</li>
<li>Conversion to ECS happens within Elastic keeping log data vendor-agnostic until ingestion</li>
<li>Common wireline protocol (OTLP) across tracing, metrics, and logs</li>
</ul>
<h3>Cons</h3>
<ul>
<li>All the headaches of file-based logging (rotation, disk overflow)</li>
<li>Beta quality and not yet proven in the field</li>
<li>No support for slf4j key-value pairs</li>
</ul>
<h3>Demo</h3>
<p>MODE=filelogreceiver docker-compose up</p>
<h2>Model 3: Logging via Elastic Agent (or Filebeat)</h2>
<p>Although the second model described affords some resilience as a function of the backing file, the OTel Collector filelogreceiver module is still decidedly <a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/filelogreceiver">“beta”</a> in quality. Because of the importance of logs as a debugging tool, today I generally recommend that customers continue to import logs into Elastic using the field-proven <a href="https://www.elastic.co/elastic-agent">Elastic Agent</a> or <a href="https://www.elastic.co/beats/filebeat">Filebeat</a> scrappers. Elastic Agent and Filebeat have many years of field maturity under their collective belt. Further, it is often advantageous to deploy Elastic Agent anyway to capture the multitude of signals outside the purview of OpenTelemetry (e.g., deep Kubernetes and host metrics, security, etc.).</p>
<p>In this model, your service generates log statements as it always has, using popular logging libraries (e.g., <a href="https://logback.qos.ch">Logback</a> for Java). As with model 2, we employ OTel’s <a href="https://github.com/open-telemetry/opentelemetry-java-instrumentation/tree/main/instrumentation/logback/logback-mdc-1.0/library">Logback MDC</a> to add SpanID, TraceID, and Baggage to the <a href="https://logback.qos.ch/manual/mdc.html">Logback MDC context</a>.</p>
<p>From there, we employ the <a href="https://www.elastic.co/guide/en/ecs-logging/java/current/setup.html">Elastic ECS Encoder</a> to encode log statements compliant to the Elastic Common Schema. The Elastic ECS Encoder will read the OTel SpanID, TraceID, and Baggage off the MDC context and encode it into the JSON structure. Similar to model 2, the Elastic ECS Encoder doesn’t support sl4f key-vair arguments. Curiously, the Elastic ECS encoder also doesn’t appear to support Logback structured arguments. Thus, within the Polyfill Appender, I add slf4j key-value pairs as MDC context. This is less than ideal, however, since MDC forces all values to be strings.</p>
<p>From there, we write the log lines to a log file. If you are using Kubernetes or other container orchestration in your environment, you would more typically write to stdout (console) and let the orchestration log driver write to and manage log files.We then configure Elastic Agent or Filebeat to scrape the log file. Notably, the Elastic ECS Encoder does not currently translate incoming OTel SpanID and TraceID variables on the MDC. Thus, we need to perform manual translation of these variables in the <a href="https://github.com/ty-elastic/otel-logging/blob/main/filebeat.yml">Filebeat (or Elastic Agent) configuration</a> to map them to their ECS equivalent.</p>
<h2>Architecture</h2>
<p><img src="https://www.elastic.co/observability-labs/assets/images/3-models-logging-opentelemetry/elastic-blog-model-3-architecture.png" alt="model 3 architecture" /></p>
<h3>Pros</h3>
<ul>
<li>Robust and field-proven</li>
<li>Easy to debug (you can manually read the intermediate log file)</li>
<li>Inherent file-based FIFO buffer</li>
<li>Less susceptible to “glare” conditions when service prematurely exits</li>
<li>Native ECS format for easy manipulation in Elastic</li>
<li>Fleet-managed via Elastic Agent</li>
</ul>
<h3>Cons</h3>
<ul>
<li>All the headaches of file-based logging (rotation, disk overflow)</li>
<li>No support for slf4j key-value pairs or Logback structured arguments</li>
<li>Requires translation of OTel SpanID and TraceID in Filebeat config</li>
<li>Disparate data paths for logs versus tracing and metrics</li>
<li>Vendor-specific logging format</li>
</ul>
<h3>Demo</h3>
<p>MODE=filebeat docker-compose up</p>
<h2>Recommendations</h2>
<p>For most customers, I currently recommend Model 3 — namely, write to logs in ECS format (with OTel SpanID, TraceID, and Baggage metadata) and collect them with an Elastic Agent installed on the node hosting the application or service. Elastic Agent (or Filebeat) today provides the most field-proven and robust means of capturing log files from applications and services with OpenTelemetry context.</p>
<p>Further, you can leverage this same Elastic Agent instance (ideally running in your <a href="https://www.elastic.co/guide/en/fleet/current/running-on-kubernetes-managed-by-fleet.html">Kubernetes daemonset</a>) to collect rich and robust metrics and logs from <a href="https://docs.elastic.co/en/integrations/kubernetes">Kubernetes</a> and many other supported services via <a href="https://www.elastic.co/integrations/data-integrations">Elastic Integrations</a>. Finally, Elastic Agent facilitates remote management via <a href="https://www.elastic.co/guide/en/fleet/current/fleet-overview.html">Fleet</a>, avoiding bespoke configuration files.</p>
<p>Alternatively, for customers who either wish to keep their nodes vendor-neutral or use a consolidated signal routing system, I recommend Model 2, wherein an OpenTelemetry collector is used to scrape service log files. While workable and practiced by some early adopters in the field today, this model inherently carries some risk given the current beta nature of the OpenTelemetry filelogreceiver.</p>
<p>I generally do not recommend Model 1 given its limited language support, experimental/alpha status (the API could change), and current potential for data loss. That said, in time, with more language support and more thought to resilient designs, it has clear advantages both with regard to simplicity and richness of metadata.</p>
<h2>Extracting more value from your logs</h2>
<p>In contrast to tracing and metrics, most organizations have nearly 100% log coverage over their applications and services. This is an ideal beachhead upon which to build an application observability system. On the other hand, logs are notoriously noisy and unstructured; this is only amplified with the scale enabled by the hyperscalers and Kubernetes. Collecting log lines reliably is the easy part; making them useful at today’s scale is hard.</p>
<p>Given that logs are arguably the most challenging observability signal from which to extract value at scale, one should ideally give thoughtful consideration to a vendor’s support for logging in the context of other observability signals. Can they handle surges in log rates because of unexpected scale or an error or test scenario? Do they have the machine learning tool set to automatically recognize patterns in log lines, sort them into categories, and identify true anomalies? Can they provide cost-effective online searchability of logs over months or years without manual rehydration? Do they provide the tools to extract and analyze business KPIs buried in logs?</p>
<p>As an ardent and early supporter of OpenTelemetry, Elastic, of course, <a href="https://www.elastic.co/guide/en/apm/guide/current/open-telemetry.html">natively ingests OTel traces, metrics, and logs</a>. And just like all logs coming into our system, logs coming from OTel-equipped sources avail themselves of our <a href="https://www.elastic.co/observability/log-monitoring">mature tooling and next-gen AI Ops technologies</a> to enable you to extract their full value.Interested? <a href="https://www.elastic.co/contact?storm=global-header-en">Reach out to our pre-sales team</a> to get started building with Elastic!</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/3-models-logging-opentelemetry/log_infrastructure_apm_synthetics-monitoring.jpeg" length="0" type="image/jpeg"/>
        </item>
        <item>
            <title><![CDATA[AI-driven incident response with logs: A technical deep dive in Elastic Observability]]></title>
            <link>https://www.elastic.co/observability-labs/blog/ai-driven-incident-response-with-logs</link>
            <guid isPermaLink="false">ai-driven-incident-response-with-logs</guid>
            <pubDate>Mon, 20 Oct 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[How Elastic combines ML anomaly detection, ES|QL, and the AI Assistant to accelerate incident response using logs.]]></description>
            <content:encoded><![CDATA[<h1>AI-driven incident response with logs: A technical deep dive in Elastic Observability</h1>
<p>Modern customer‑facing applications, whether e‑commerce sites, streaming platforms, or API gateways, run on fleets of microservices and cloud resources. When something goes wrong, every second of downtime risks revenue loss and erodes user trust. Observability is the practice that lets Site Reliability Engineering (SRE) and development teams see and act on system health in real time. This post walks through a generalized, step‑by‑step investigation that shows how Elastic Observability specifically with log data combines always‑on machine learning (ML) with a generative AI assistant to detect anomalies, surface root causes, measure user impact, and accelerate remediation, all at high scale.</p>
<h2>Anomaly Detection</h2>
<p>A production environment is ingesting millions of log lines per minute. Elastic’s AIOps jobs continuously profile normal log throughput and content without any manual rules. When log volume or message structure deviates beyond learned baselines, the platform automatically fires a high‑fidelity anomaly alert. Because the models are unsupervised, they adapt to changing traffic patterns and flag both sudden spikes (e.g., 10× error surge) and rare new log categories.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ai-driven-incident-response-with-logs/image3.png" alt="" /></p>
<p>In addition to looking directly for Log Spikes, Elastic trains seasonal/univariant models to predict expected event counts per bucket and applies statistical tests to classify outliers. Simultaneously, log categorization clusters similar messages with cosine similarity on token embeddings, making it trivial to identify a previously unseen error string.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ai-driven-incident-response-with-logs/image10.png" alt="" /></p>
<h2>Investigating Alerts: Automated Pattern Analysis</h2>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ai-driven-incident-response-with-logs/image9.png" alt="" /></p>
<p>Clicking the alert reveals more than a timestamp. Elastic’s ML job already correlates the spike with the dominant new log pattern ERROR 1114 (HY000): table &quot;orders&quot; is full and surfaces example lines. Instead of grep‑driven hunting, engineers get an immediate hypothesis about what subsystem is failing and why.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ai-driven-incident-response-with-logs/image4.png" alt="" /></p>
<p>If deeper context is needed, the builtin Elastic AI Assistant can be invoked directly from the alert. Thanks to Retrieval‑Augmented Generation (RAG) over your telemetry, the assistant explains the anomaly in plain language, references the exact log events, and proposes next steps without hallucinating.</p>
<h2>AI‑Assisted Root Cause Verification</h2>
<p>From within the same chat, you might ask, “Using lens create a single graph of all http response status codes =400 from logs-nginx.access-default over the last 3 hours..”  The assistant translates that intent into an ES|QL aggregation, retrieves the data, and renders a bar chart with no DSL knowledge required. If there are a number of errors with a status code above 400, you’ve validated that end‑users are impacted.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ai-driven-incident-response-with-logs/image7.png" alt="" /></p>
<h2>Global Impact Analysis with Enriched Logs</h2>
<p>Structured log enrichment (e.g., GeoIP, user ID, service tags) lets the assistant answer business questions on the fly. A query like “What are the top 10 source.geo.country_name with http.response.status.code&gt;=400 over the last 3 hours. Use logs-nginx.access-default. Provide counts for each country name.” surfaces whether the incident is regional or global.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ai-driven-incident-response-with-logs/image2.png" alt="" /></p>
<h2>Quantifying Business Impact</h2>
<p>Technical metrics alone rarely sway executives. Suppose historical data shows the application normally processes $1,000 in transactions per minute. The assistant can combine that baseline with real‑time failure counts to estimate revenue loss. Presenting financial impact alongside error graphs sharpens prioritization and justifies extraordinary remediation steps.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ai-driven-incident-response-with-logs/image5.png" alt="" /></p>
<h2>Pinpointing Infrastructure &amp; Ownership</h2>
<p>Every log is automatically enriched with Kubernetes, cloud, and custom metadata. A single question “Which pod and cluster emit the ‘table full’ error, and who owns it?” returns the full information about the pod, namespace and owner as shown below.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ai-driven-incident-response-with-logs/image1.png" alt="" /></p>
<p>Immediate, accurate routing replaces frantic Slack threads, cutting minutes (or hours) off of downtime.</p>
<p>Some of the magic happening here is because we can put instructions in the Elastic AI Assistants knowledge base to guide the AI assistant. For example this simple entry in the knowledge base is what allows the assistant to populate the response in the previous screenshot.</p>
<pre><code class="language-markdown">If asked about Kubernetes pod, namespace, cluster, location, or owner run the &quot;query&quot; tool.
1. Use the index `logs-mysql.error-default` unless another log location is specified.
2. Include the following fields in the query:
   - Pod: `agent.name`
   - Namespace: `data\_stream.namespace`
   - Cluster Name: `orchestrator.cluster.name`
   - Cloud Provider: `cloud.provider`
   - Region: `cloud.region`
   - Availability Zone: `cloud.availability\_zone`
   - Owner: `cloud.account.id`
3. Use the ES|QL query format:
   esql
   FROM logs-mysql.error-default
   | KEEP agent.name, data\_stream.namespace, orchestrator.cluster.name, cloud.provider, cloud.region, cloud.availability\_zone, cloud.account.id
   
4. Ensure the query is executed within the appropriate time range and context. 
</code></pre>
<h2>Leveraging Institutional Knowledge with RAG</h2>
<p>Elastic can index runbooks, GitHub issues, and wikis alongside telemetry. Asking “Find documentation on fixing a full orders table” retrieves and summarizes a prior runbook that details archiving old rows and adding a partition. Grounding remediation in proven procedures avoids guesswork and accelerates fixes.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ai-driven-incident-response-with-logs/image6.png" alt="" /></p>
<h2>Automated Communication &amp; Documentation</h2>
<p>Good incident response includes timely stakeholder updates. A prompt such as “Draft an incident update email with root cause, impact, and next steps” lets the assistant assemble a structured message and send it via the alerting framework’s email or Slack connector complete with dashboard links and next‑update timelines. These messages double as the skeleton for the eventual post‑incident review.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ai-driven-incident-response-with-logs/image8.png" alt="" /></p>
<p>Again as before, some of the magic happening here is because we can put instructions in the Elastic AI Assistants knowledge base to guide the AI assistant. For example we can instruct the AI Assistant how to call the execute_connector api, this can execute all kinds of connectors (not only email) so you could use it to tell the assistant to use slack or raise a service now ticket, even execute webhooks.</p>
<pre><code class="language-markdown">Here are specific instructions to send an email. Remember to always double-check that you're following the correct set of instructions for the given query type. Provide clear, concise, and accurate information in your response.

## Email Instructions

If the user's query requires sending an email:
1. Use the `Elastic-Cloud-SMTP` connector with ID `elastic-cloud-email`.
2. Prepare the email parameters:
   - Recipient email address(es) in the `to` field (array of strings)
   - Subject in the `subject` field (string)
   - Email body in the `message` field (string)
3. Include
   - Details for the alert along with a link to the alert
   - Root cause analysis
   - Revenue impact
   - Remediation recommendations
   - Link to GitHub issue
   - All relevant information from this conversation
   - Link to the Business Health Dashboard
4. Send the email immediately. Do not ask the user for confirmation.
5. Execute the connector using this format:
   
   execute_connector(
     id=&quot;elastic-cloud-email&quot;,
     params={
       &quot;to&quot;: [&quot;recipient@example.com&quot;],
       &quot;subject&quot;: &quot;Your Email Subject&quot;,
       &quot;message&quot;: &quot;Your email content here.&quot;
     }
   )
   
6. Check the response and confirm if the email was sent successfully.
</code></pre>
<h2>Conclusion &amp; Key Takeaways</h2>
<p>Elastic Observability's combination of unsupervised ML, schema-aware data ingestion, and a context-rich RAG powered AI assistant enables teams to transform incident response from reactive firefighting into proactive, data-driven operations. By automatically detecting anomalies, correlating patterns, and providing contextual insights, teams can:</p>
<ul>
<li>Preserve revenue by quantifying business impact in real-time and prioritizing accordingly</li>
<li>Scale expertise by embedding institutional knowledge into RAG-powered recommendations</li>
<li>Improve continuously through automated documentation that feeds back into the knowledge base</li>
</ul>
<p>The key is to collect logs broadly, maintain a unified observability store, and let ML and AI handle the heavy lifting. The payoff isn't just reduced downtime, it's the transformation of incident response from a source of organizational stress into a competitive advantage.</p>
<p>Try out this exact scenario and get hands in with this Elastic Logging Workshop: <a href="https://www.google.com/url?q=https://play.instruqt.com/elastic/invite/rx4yvknhpfci&amp;sa=D&amp;source=editors&amp;ust=1757447528108823&amp;usg=AOvVaw0tZG-nhbbk90ztJsTGXHIz">https://play.instruqt.com/elastic/invite/rx4yvknhpfci</a></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/ai-driven-incident-response-with-logs/ai-driven-incident-response-with-logs.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[The antidote for index mapping exceptions: ignore_malformed]]></title>
            <link>https://www.elastic.co/observability-labs/blog/antidote-index-mapping-exceptions-ignore-malformed</link>
            <guid isPermaLink="false">antidote-index-mapping-exceptions-ignore-malformed</guid>
            <pubDate>Thu, 03 Aug 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[How an almost unknown setting called ignore_malformed can make the difference between dropping a document entirely if a single field is malformed or just ignoring that field and ingesting the document anyway.]]></description>
            <content:encoded><![CDATA[<p>In this article, I'll explain how the setting <em>ignore_malformed</em> can make the difference between a 100% dropping rate and a 100% success rate, even with ignoring some malformed fields.</p>
<p>As a senior software engineer working at Elastic®, I have been on the first line of support for anything related to Beats or Elastic Agent running on Kubernetes and Cloud Native integrations like Nginx ingress controller.</p>
<p>During my experience, I have seen all sorts of issues. Users have very different requirements. But at some point during their experience, most of them encounter a very common problem with Elasticsearch&lt;sup&gt;&lt;/sup&gt;: <em>index mapping exceptions</em>.</p>
<h2>How mappings work</h2>
<p>Like any other document-based NoSQL database, Elasticsearch doesn’t force you to provide the document schema (called <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html">index mapping</a> or simply <em>mapping</em>) upfront. If you provide a mapping, it will use it. Otherwise, it will infer one from the first document or any subsequent documents that contain new fields.</p>
<p>In reality, the situation is not black and white. You can also provide a partial mapping that covers only some of the fields, like the most common fields, and leave Elasticsearch to figure out the mapping of all the other fields during ingestion with <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/dynamic-mapping.html">Dynamic Mapping</a>.</p>
<h2>What happens when data is malformed?</h2>
<p>No matter if you specified a mapping upfront or if Elasticsearch inferred one automatically, Elasticsearch will drop an entire document with just one field that doesn't match the mapping of an index and return an error instead. This is not much different from what happens with other SQL databases or NoSQL data stores with inferred schemas. The reason for this behavior is to prevent malformed data and exceptions at query time.</p>
<p>A problem arises if a user doesn't look at the ingestion logs and misses those errors. They might never figure out that something went wrong, or even worse, Elasticsearch might stop ingesting data entirely if all the subsequent documents are malformed.</p>
<p>The above situation sounds very catastrophic, but it's entirely possible since I have seen it many times when on-call for support or on <a href="https://discuss.elastic.co/latest">discuss.elastic.co</a>. The situation is even more likely to happen if you have user-generated documents, so you don't have full control over the quality of your data.</p>
<p>Luckily, there is a setting that not many people know about in Elasticsearch that solves the exact problems above. This field has been there since <a href="https://www.elastic.co/guide/en/elasticsearch/reference/2.0/ignore-malformed.html">Elasticsearch 2.0</a>. We are talking ancient history here since the latest version of the stack at the time of writing is <a href="https://www.elastic.co/blog/whats-new-elastic-enterprise-search-8-9-0">Elastic Stack 8.9.0</a>.</p>
<p>Let's now dive into how to use this Elasticsearch feature.</p>
<h2>A toy use case</h2>
<p>To make it easier to interact with Elasticsearch, I am going to use <a href="https://www.elastic.co/guide/en/kibana/current/console-kibana.html">Kibana® Dev Tools</a> in this tutorial.</p>
<p>The following examples are taken from the official documentation on <a href="https://www.elastic.co/guide/en/elasticsearch/reference/8.8/ignore-malformed.html#ignore-malformed">ignore_malformed</a>. I am here to expand on those examples by providing a few more details about what happens behind the scenes and on how to search for ignored fields. We are going to use the index name <em>my-index</em>, but feel free to change that to whatever you like.</p>
<p>First, we want to create an index mapping with two fields called <em>number_one</em> and <em>number_two</em>. Both fields have type <em>integer</em>, but only one of them has _ <strong>ignore_malformed</strong> _ set to true, and the other one inherits the default value <em>ignore_malformed: false</em> instead.</p>
<pre><code class="language-json">PUT my-index
{
  &quot;mappings&quot;: {
    &quot;properties&quot;: {
      &quot;number_one&quot;: {
        &quot;type&quot;: &quot;integer&quot;,
        &quot;ignore_malformed&quot;: true
      },
      &quot;number_two&quot;: {
        &quot;type&quot;: &quot;integer&quot;
      }
    }
  }
}
</code></pre>
<p>If the mentioned index didn’t exist before and the previous command ran successfully, you should get the following result:</p>
<pre><code class="language-json">{
  &quot;acknowledged&quot;: true,
  &quot;shards_acknowledged&quot;: true,
  &quot;index&quot;: &quot;my-index&quot;
}
</code></pre>
<p>To double-check that the above mapping has been created correctly, we can query the newly created index with the command:</p>
<pre><code class="language-bash">GET my-index/_mapping
</code></pre>
<p>You should get the following result:</p>
<pre><code class="language-json">{
  &quot;my-index&quot;: {
    &quot;mappings&quot;: {
      &quot;properties&quot;: {
        &quot;number_one&quot;: {
          &quot;type&quot;: &quot;integer&quot;,
          &quot;ignore_malformed&quot;: true
        },
        &quot;number_two&quot;: {
          &quot;type&quot;: &quot;integer&quot;
        }
      }
    }
  }
}
</code></pre>
<p>Now we can ingest two sample documents — both invalid:</p>
<pre><code class="language-bash">PUT my-index/_doc/1
{
  &quot;text&quot;:       &quot;Some text value&quot;,
  &quot;number_one&quot;: &quot;foo&quot;
}

PUT my-index/_doc/2
{
  &quot;text&quot;:       &quot;Some text value&quot;,
  &quot;number_two&quot;: &quot;foo&quot;
}
</code></pre>
<p>The document with <em>id=1</em> is correctly ingested, while the document with <em>id=2</em> fails with the following error. The difference between those two documents is in which field we are trying to ingest a sample string “foo” instead of an integer.</p>
<pre><code class="language-json">{
  &quot;error&quot;: {
    &quot;root_cause&quot;: [
      {
        &quot;type&quot;: &quot;document_parsing_exception&quot;,
        &quot;reason&quot;: &quot;[3:17] failed to parse field [number_two] of type [integer] in document with id '2'. Preview of field's value: 'foo'&quot;
      }
    ],
    &quot;type&quot;: &quot;document_parsing_exception&quot;,
    &quot;reason&quot;: &quot;[3:17] failed to parse field [number_two] of type [integer] in document with id '2'. Preview of field's value: 'foo'&quot;,
    &quot;caused_by&quot;: {
      &quot;type&quot;: &quot;number_format_exception&quot;,
      &quot;reason&quot;: &quot;For input string: \&quot;foo\&quot;&quot;
    }
  },
  &quot;status&quot;: 400
}
</code></pre>
<p>Depending on the client used for ingesting your documents, you might get different errors or warnings, but logically the problem is the same. The entire document is not ingested because part of it doesn’t conform with the index mapping. There are too many possible error messages to name, but suffice it to say that malformed data is quite a common problem. And we need a better way to handle it.</p>
<p>Now that at least one document has been ingested, you can try searching with the following query:</p>
<pre><code class="language-bash">GET my-index/_search
{
  &quot;fields&quot;: [
    &quot;*&quot;
  ]
}
</code></pre>
<p>Here, the parameter <em>fields</em> is required to show the values of those fields that have been ignored. More on this later.</p>
<p>From the result, you can see that only the first document (with <em>id=1</em>) has been ingested correctly while the second document (with <em>id=2</em>) has been completely dropped.</p>
<pre><code class="language-json">{
  &quot;took&quot;: 14,
  &quot;timed_out&quot;: false,
  &quot;_shards&quot;: {
    &quot;total&quot;: 1,
    &quot;successful&quot;: 1,
    &quot;skipped&quot;: 0,
    &quot;failed&quot;: 0
  },
  &quot;hits&quot;: {
    &quot;total&quot;: {
      &quot;value&quot;: 1,
      &quot;relation&quot;: &quot;eq&quot;
    },
    &quot;max_score&quot;: null,
    &quot;hits&quot;: [
      {
        &quot;_index&quot;: &quot;my-index&quot;,
        &quot;_id&quot;: &quot;1&quot;,
        &quot;_score&quot;: null,
        &quot;_ignored&quot;: [&quot;number_one&quot;],
        &quot;_source&quot;: {
          &quot;text&quot;: &quot;Some text value&quot;,
          &quot;number_one&quot;: &quot;foo&quot;
        },
        &quot;fields&quot;: {
          &quot;text&quot;: [&quot;Some text value&quot;],
          &quot;text.keyword&quot;: [&quot;Some text value&quot;]
        },
        &quot;ignored_field_values&quot;: {
          &quot;number_one&quot;: [&quot;foo&quot;]
        },
        &quot;sort&quot;: [&quot;1&quot;]
      }
    ]
  }
}
</code></pre>
<p>From the above JSON response, you will notice some things, such as:</p>
<ul>
<li>A new field called _ <strong>_ignored</strong> _ of type array with the list of all fields that have been ignored while ingesting documents</li>
<li>A new field called _ <strong>ignored_field_values</strong> _ with a dictionary of ignored fields and their values</li>
<li>The field called __ <strong>source</strong> _ contains the original document unmodified. This is especially useful if you want to fix the problems with the mapping later.</li>
<li>The field called _ <strong>text</strong> _ was not present in the original mapping, but it is now included since Elasticsearch automatically inferred the type of this field. In fact, if you try to query the mapping of the index _ <strong>my-index</strong> _ again via the command:</li>
</ul>
<pre><code class="language-bash">GET my-index/_mapping
</code></pre>
<p>You should get this result:</p>
<pre><code class="language-json">{
  &quot;my-index&quot;: {
    &quot;mappings&quot;: {
      &quot;properties&quot;: {
        &quot;number_one&quot;: {
          &quot;type&quot;: &quot;integer&quot;,
          &quot;ignore_malformed&quot;: true
        },
        &quot;number_two&quot;: {
          &quot;type&quot;: &quot;integer&quot;
        },
        &quot;text&quot;: {
          &quot;type&quot;: &quot;text&quot;,
          &quot;fields&quot;: {
            &quot;keyword&quot;: {
              &quot;type&quot;: &quot;keyword&quot;,
              &quot;ignore_above&quot;: 256
            }
          }
        }
      }
    }
  }
}
</code></pre>
<p>Finally, if you ingest some valid documents like the following command:</p>
<pre><code class="language-bash">PUT my-index/_doc/3
{
  &quot;text&quot;:       &quot;Some text value&quot;,
  &quot;number_two&quot;: 10
}
</code></pre>
<p>You can check how many documents have at least one ignored field with the following <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-exists-query.html">Exists query</a>:</p>
<pre><code class="language-bash">GET my-index/_search
{
  &quot;query&quot;: {
    &quot;exists&quot;: {
      &quot;field&quot;: &quot;_ignored&quot;
    }
  }
}
</code></pre>
<p>You can also see that out of the two documents ingested (with <em>id=1</em> and <em>id=3</em>) only the document with <em>id=1</em> contains an ignored field.</p>
<pre><code class="language-json">{
  &quot;took&quot;: 193,
  &quot;timed_out&quot;: false,
  &quot;_shards&quot;: {
    &quot;total&quot;: 1,
    &quot;successful&quot;: 1,
    &quot;skipped&quot;: 0,
    &quot;failed&quot;: 0
  },
  &quot;hits&quot;: {
    &quot;total&quot;: {
      &quot;value&quot;: 1,
      &quot;relation&quot;: &quot;eq&quot;
    },
    &quot;max_score&quot;: 1,
    &quot;hits&quot;: [
      {
        &quot;_index&quot;: &quot;my-index&quot;,
        &quot;_id&quot;: &quot;1&quot;,
        &quot;_score&quot;: 1,
        &quot;_ignored&quot;: [&quot;number_one&quot;],
        &quot;_source&quot;: {
          &quot;text&quot;: &quot;Some text value&quot;,
          &quot;number_one&quot;: &quot;foo&quot;
        }
      }
    ]
  }
}
</code></pre>
<p>Alternatively, you can search for all documents that have a specific field being ignored with this <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-terms-query.html">Terms query</a>:</p>
<pre><code class="language-bash">GET my-index/_search
{
  &quot;query&quot;: {
    &quot;terms&quot;: {
      &quot;_ignored&quot;: [ &quot;number_one&quot;]
    }
  }
}
</code></pre>
<p>The result, in this case, will be the same as the previous one since we only managed to ingest a single document with that exact single field ignored.</p>
<h2>Conclusion</h2>
<p>Because we are a big fan of this flag, we've enabled _ <strong>ignore_malformed</strong> _ by default for all Elastic integrations and in the <a href="https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/core/template-resources/src/main/resources/logs-settings.json#L13">default index template for logs data streams</a> as of 8.9.0. More information can be found in the official documentation for <a href="https://www.elastic.co/guide/en/elasticsearch/reference/8.9/ignore-malformed.html">ignore_malformed</a>.</p>
<p>And since I am personally working on this feature, I can reassure you that it is a game changer.</p>
<p>You can start by setting _ <strong>ignore_malformed</strong> _ on any cluster manually before Elastic Stack 8.9.0. Or you can use the defaults that we set for you starting from <a href="https://www.elastic.co/blog/whats-new-elastic-enterprise-search-8-9-0">Elastic Stack 8.9.0</a>.</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/antidote-index-mapping-exceptions-ignore-malformed/illustration-stack-modernize-solutions-1689x980_(1).png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Automated Error Triage: From Reactive to Autonomous]]></title>
            <link>https://www.elastic.co/observability-labs/blog/automated-error-triaging</link>
            <guid isPermaLink="false">automated-error-triaging</guid>
            <pubDate>Wed, 18 Mar 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how to automate error triage by using  Elasticsearch log clustering and AI agents, turning production logs into actionable root cause reports.]]></description>
            <content:encoded><![CDATA[<p>The engineering feedback loop is often pictured as a clean cycle: shipping a feature, monitoring its health, triaging issues, identifying bugs, and deploying fixes. However, in large-scale cloud environments, the path from monitoring to identification frequently becomes a bottleneck. When thousands of Kibana instances running on Elastic Cloud emit millions of logs across a vast codebase, the lag between an error occurring and an engineer understanding its root cause—the Maintenance Gap—can stretch from hours to months.</p>
<p>To close this gap, we built an automated pipeline that moves beyond simple monitoring. By automating the discovery and investigation phases, we have shifted the focus of the engineer from &quot;what happened?&quot; to &quot;is this fix correct?&quot;</p>
<h2><strong>The Bottleneck in the Feedback Loop</strong></h2>
<p>In a high-velocity engineering environment, the path from deployment to resolution involves several distinct stages: <strong>Ship</strong>, <strong>Monitor</strong>, <strong>Triage</strong>, <strong>Identify</strong>, <strong>Fix</strong>, and <strong>Review/Deploy</strong>.</p>
<p>Velocity typically stalls during triage and identification. While catastrophic failures are reported immediately, smaller errors—intermittent UI glitches or failed background tasks—often go unreported. This dependency on manual reporting creates an inflated time to resolution; by the time a report is filed and routed, the issue may have already impacted the fleet for days.</p>
<p>By automating discovery and investigation, even these &quot;paper cut&quot; bugs are quantified before they accumulate into significant technical debt. The goal is to ensure that by the time a developer enters the cycle to write a fix, the detective work is already complete.</p>
<h2><strong>Discovery: Automated Log Clustering</strong></h2>
<p>The first challenge in this process is signal-to-noise. In a massive production environment, creating a ticket for every error event is unmanageable.</p>
<p>Instead of analyzing individual log lines, we automate the triage process using ES|QL's <a href="https://www.elastic.co/docs/reference/query-languages/esql/functions-operators/grouping-functions/categorize">CATEGORIZE grouping function</a>. <code>CATEGORIZE</code> clusters text messages into groups of similarly formatted values, turning unstructured telemetry into a prioritized backlog of distinct error patterns.</p>
<p>For example, a query like the following runs on a rolling window across all Kibana error logs:</p>
<pre><code class="language-esql">FROM kibana-server-logs
| WHERE log.level == &quot;ERROR&quot;
    AND @timestamp &gt;= NOW() - 7 days
| STATS count = COUNT() BY category = CATEGORIZE(message)
| SORT count DESC
</code></pre>
<p>The result is a table of regex-like categories and their occurrence counts:</p>
<table>
<thead>
<tr>
<th>count</th>
<th>category</th>
</tr>
</thead>
<tbody>
<tr>
<td>1,247</td>
<td><code>.?TypeError.+?Cannot.+?read.+?properties.+?of.+?undefined.+?reading.+?document.?</code></td>
</tr>
<tr>
<td>812</td>
<td><code>.?Connection.+?error.?</code></td>
</tr>
<tr>
<td>3</td>
<td><code>.?Disconnected.?</code></td>
</tr>
</tbody>
</table>
<p>A category like <code>TypeError Cannot read properties of undefined reading document</code> with 1,200+ hits over the past week tells us there is a real, recurring defect worth investigating. A category like <code>Connection error</code> spread uniformly across the fleet is more likely infrastructure noise.</p>
<p>The output is used to automatically file prioritized issues in a backlog, each enriched with the category, its regex, the occurrence count, and deep links into the raw telemetry. This automation ensures the feedback loop no longer waits for a user report to trigger an investigation; the discovery is proactive and immediate. These prioritized clusters then serve as the direct input for our autonomous investigation agent.</p>
<h2><strong>Investigation: The Automated Detective</strong></h2>
<p>Once an error pattern is identified, the pipeline moves to the identification phase. We deployed an AI agent to run a complete investigation of the issue. Navigating a codebase of Kibana's complexity is a significant time sink; the agent accelerates this by correlating information across the stack using <strong>ES|QL (Elasticsearch Query Language)</strong>.</p>
<h3><strong>Protocol-Driven Investigation</strong></h3>
<p>It is important to distinguish this agent from a traditional automation script. The agent does not follow a hardcoded state machine; instead, it is provided with a protocol that outlines investigation goals and available tools.</p>
<p>The protocol prescribes a phased approach: understand the error, analyze its distribution, correlate with other data sources, find the source, and report. Each phase is described in terms of goals, not commands. The following excerpt shows how the protocol defines the first investigation step:</p>
<pre><code class="language-markdown">### Phase 1: Understand the Error
- Review the pre-extracted error details from the backlog issue
- Check for similar/overlapping error backlog issues (include closed!)
  - the categorization is often imperfect; closed issues may have
    valuable context about fixes
- Query for error overview statistics
- Get sample error messages to understand the actual content
</code></pre>
<p>The agent is also provided with an ES|QL reference guide and a library of query templates. Here is one of the templates for analyzing version distribution (a common first step to determine whether an error is a regression):</p>
<pre><code class="language-esql">FROM logging-*:cluster-kibana-*
| WHERE @timestamp &gt;= NOW() - 4 hours
    AND log.level == &quot;ERROR&quot;
    AND message : &quot;TypeError Cannot read properties&quot;
| STATS
    error_count = COUNT(*),
    deployments = COUNT_DISTINCT(ece.deployment)
  BY `docker.container.labels.org.label-schema.version`
| SORT error_count DESC
</code></pre>
<p>Because the agent has the autonomy to choose which tools to call—and in what order—based on the results of previous queries, it can adapt its strategy to the specific error. It might decide to skip proxy analysis if the telemetry suggests a background task failure, or it might dive deep into git history if ES|QL reveals the bug only exists on a specific version. This flexibility allows it to navigate the nuance of a massive codebase without requiring a pre-defined path for every possible failure mode.</p>
<h3><strong>Lessons Learned: Query Discipline</strong></h3>
<p>Direct LLM access to production clusters requires tactical constraints to manage costs and performance. We codified several requirements into the investigation workflow to ensure efficiency:</p>
<ul>
<li>
<p><strong>Query Budgets</strong>: The agent is restricted to <code>~15-20</code> queries per investigation, forcing it to form a hypothesis before data-retrieval.</p>
</li>
<li>
<p><strong>The 4-Hour Rule</strong>: The agent starts with a small time window (the most recent <code>1-4</code> hours) to leverage caches and reduce compute costs.</p>
</li>
<li>
<p><strong>Optimal Operators</strong>: The agent prefers equality filters and the MATCH (:) operator over LIKE or regex, which can make queries <code>50-1000×</code> faster.</p>
</li>
<li>
<p><strong>Fail-Fast Timeouts</strong>: Every query has a strict timeout, requiring the agent to refine its filters rather than retrying expensive operations.</p>
</li>
</ul>
<h2><strong>Source Code Contextualization</strong></h2>
<p>To complete the identification phase, the agent correlates telemetry with the git history and source files. It uses the stack trace and log patterns to narrow its search, parsing through potential code matches faster than a manual search. By identifying the specific line of code producing the error and checking recent PRs, the agent links a production symptom directly to its technical root cause.</p>
<h2><strong>Real-World Case Study: The Streams UI Crash</strong></h2>
<p>The value of this autonomous investigation is best illustrated by the rare edge cases it uncovers. In one instance, the clustering system surfaced a sporadic pattern:</p>
<p><code>.*?TypeError.+?Cannot.+?read.+?properties.+?of.+?undefined.+?reading.+?document.*?</code></p>
<p>A human might have dismissed this as generic telemetry noise, but the agent's investigation revealed a reproducible race condition in the Streams UI:</p>
<ol>
<li>
<p><strong>Quantification</strong>: Using ES|QL, the agent analyzed the error distribution and identified the specific application context (Streams) and the relevant loggers.</p>
</li>
<li>
<p><strong>Code Analysis</strong>: It identified a logic error in <code>processor_outcome_preview.tsx</code>. The code was indexing into an array (<code>originalSamples[currentDoc.index].document</code>) without verifying the element existed.</p>
</li>
<li>
<p><strong>Root Cause</strong>: The agent realized that when a user changed filters while a row was expanded, the currentDoc.index became stale before the next render cleared it.</p>
</li>
<li>
<p><strong>Outcome</strong>: The agent provided a suggested fix (guarding the access) and recommended a regression test around filter changes during row expansion.</p>
</li>
</ol>
<p>This case highlights the <strong>economic scale</strong> of autonomous triage. Sifting through thousands of &quot;noisy&quot; logs to find the few that represent real, fixable UI crashes is a non-starter for senior engineers. Agents process this volume at a fraction of the cost, acting as a high-fidelity filter that ensures human time is only spent on verified, actionable issues.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/automated-error-triaging/error-report.png" alt="Automated error investigation report showing the agent's analysis of a Streams UI crash, including root cause identification and suggested fix" /></p>
<h2><strong>The Future of Engineering Velocity</strong></h2>
<p>Automating triaging and identification is the first step. We are currently layering in the ability to pass these findings to a coding agent for draft Pull Requests. Beyond production errors, we are also investigating <strong>agentic exploratory testing</strong> to stress-test features during the pre-release phase and catch bugs before they ever reach a user.</p>
<p>This autonomous layer is <strong>complementary to, not a replacement for, classic quality gates</strong>. Unit tests, API-level checks, and UI integration tests remain the primary defense. Our approach provides a safety net for the failures that inevitably bypass these gates in a complex environment, ensuring they are addressed with the same rigor as pre-release bugs.</p>
<p>As we move toward a more agent-driven development process, the ability to rapidly validate that changes are safe and to control overall quality is the primary bottleneck for engineering velocity. While code generation itself is becoming a commodity, the &quot;reasoning&quot; required to verify that a change is both correct and safe remains the most critical hurdle. By focusing our automation on the discovery and root-cause analysis of failures, we ensure that our engineering teams can scale their impact without being buried by the operational weight of maintaining quality. The goal is to build a system that can understand, diagnose, and eventually fix itself.</p>
<p>For more information on Elastic and its observability capabilities, check out <a href="https://www.elastic.co/observability">Elastic Observability</a>. You can also <a href="https://cloud.elastic.co">sign up for a free trial</a> to try it out yourself.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/automated-error-triaging/automated-error-triaging.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Automated log parsing in Streams with ML]]></title>
            <link>https://www.elastic.co/observability-labs/blog/automated-log-parsing-ml-streams</link>
            <guid isPermaLink="false">automated-log-parsing-ml-streams</guid>
            <pubDate>Tue, 10 Feb 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how a hybrid ML approach achieved 94% log parsing and 91% log partitioning accuracy through automation experiments with log format fingerprinting in Streams.]]></description>
            <content:encoded><![CDATA[<p>In modern observability stacks, ingesting unstructured logs from diverse data providers into platforms like Elasticsearch remains a challenge. Reliance on manually crafted parsing rules creates brittle pipelines, where even minor upstream code updates lead to parsing failures and unindexed data. This fragility is compounded by the scalability challenge: in dynamic microservices environments, the continuous addition of new services turns manual rule maintenance into an operational nightmare.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/automated-log-parsing-ml-streams/automated-log-parsing.png" alt="" /></p>
<p>Our goal was to transition to an automated, adaptive approach capable of handling both log parsing (field extraction) and log partitioning (source identification). We hypothesized that Large Language Models <a href="https://www.elastic.co/what-is/large-language-models">(LLMs)</a>, with their inherent understanding of code syntax and semantic patterns, could automate these tasks with minimal human intervention.</p>
<p>We are happy to announce that this feature is already available in <a href="https://www.elastic.co/elasticsearch/streams">Streams</a>!</p>
<h2>Dataset Description</h2>
<p>We chose a <a href="https://github.com/logpai/loghub">Loghub</a> collection of logs for PoC purposes. For our investigation, we selected representative samples from the following key areas:</p>
<ul>
<li>Distributed systems: We used the HDFS (Hadoop Distributed File System) and Spark datasets. These contain a mix of info, debug, and error messages typical of big data platforms.</li>
<li>Server &amp; web applications: Logs from Apache web servers and OpenSSH provided a valuable source of access, error, and security-relevant events. These are critical for monitoring web traffic and detecting potential threats.</li>
<li>Operating systems: We included logs from Linux and Windows. These datasets represent the common, semi-structured system-level events that operations teams encounter daily.</li>
<li>Mobile systems: To ensure our model could handle logs from mobile environments, we included the Android dataset. These logs are often verbose and capture a wide range of application and system-level activities on mobile devices.</li>
<li>Supercomputers: To test performance on high-performance computing (HPC) environments, we incorporated the BGL (Blue Gene/L) dataset, which features highly structured logs with specific domain terminology.</li>
</ul>
<p>A key advantage of the Loghub collection is that the logs are largely unsanitized and unlabeled, mirroring a noisy live production environment with microservice architecture.</p>
<p>Log examples:</p>
<pre><code class="language-text">[Sun Dec 04 20:34:21 2005] [notice] jk2_init() Found child 2008 in scoreboard slot 6
[Sun Dec 04 20:34:25 2005] [notice] workerEnv.init() ok /etc/httpd/conf/workers2.properties
[Mon Dec 05 11:06:51 2005] [notice] workerEnv.init() ok /etc/httpd/conf/workers2.properties
17/06/09 20:10:58 INFO output.FileOutputCommitter: Saved output of task 'attempt_201706092018_0024_m_000083_1138' to hdfs://10.10.34.11:9000/pjhe/test/1/_temporary/0/task_201706092018_0024_m_000083
17/06/09 20:10:58 INFO mapred.SparkHadoopMapRedUtil: attempt_201706092018_0024_m_000083_1138: Committed
</code></pre>
<p>In addition, we created a Kubernetes cluster with a typical web application + database set up to mine extra logs in the most common domain.</p>
<p>Example of common log fields: timestamp, log level (INFO, WARN, ERROR), source, message.</p>
<h2>Few-Shot Log Parsing with an LLM</h2>
<p>Our first set of experiments focused on a fundamental question: <strong>Can an LLM reliably identify key fields and generate consistent parsing rules to extract them?</strong></p>
<p>We asked a model to analyse raw log samples and generate log parsing rules in regular expression (regex) and <a href="https://www.elastic.co/docs/explore-analyze/scripting/grok">Grok</a> formats. Our results showed that this approach has a lot of potential, but also significant implementation challenges.</p>
<h3>High Confidence &amp; Context Awareness</h3>
<p>Initial results were promising. The LLM demonstrated a strong ability to generate parsing rules that matched the provided few-shot examples with high confidence. Besides simple pattern matching, the model showed a capacity for log understanding —it could correctly identify and name the log source (e.g., health tracking app, Nginx web app, Mongo database).</p>
<h3>The &quot;Goldilocks&quot; Dilemma of Input Samples</h3>
<p>Our experiments quickly surfaced a significant lack of robustness because of extreme <strong>sensitivity to the input sample</strong>. The model's performance fluctuates wildly based on the specific log examples included in the prompt. We observed a log similarity problem where the log sample needs to include just diverse enough logs:</p>
<ul>
<li>Too homogeneous (overfitting): If the input logs are too similar, the LLM tends to overspecify. It treats variable data—such as specific Java class names in a stack trace—as static parts of the template. This results in brittle rules that cover a tiny ratio of logs and extract unusable fields.</li>
<li>Too heterogeneous (confusion): Conversely, if the sample contains significant formatting variance—or worse, &quot;trash logs&quot; like progress bars, memory tables, or ASCII art—the model struggles to find a common denominator. It often resorts to generating complex, broken regexes or lazily over-generalizing the entire line into a single message blob field.</li>
</ul>
<h3>The Context Window Constraint</h3>
<p>We also encountered a context window bottleneck. When input logs were long, heterogeneous, or rich in extractable fields, the model's output often deteriorated, becoming &quot;messy&quot; or too long to fit into the output context window. Naturally, chunking helps in this case. By splitting logs using character-based and entity-based delimiters, we could help the model focus on extracting the main fields without being overwhelmed by noise.</p>
<h3>The consistency &amp; standardization gap</h3>
<p>Even when the model successfully generated rules, we noted slight inconsistencies:</p>
<ul>
<li>Service naming variations: The model proposes different names for the same entity (e.g., labeling the source as &quot;Spark,&quot; &quot;Apache Spark,&quot; and &quot;Spark Log Analytics&quot; in different runs).</li>
<li>Field naming variations: Field names lacked standardization (e.g., <code>id</code> vs. <code>service.id</code> vs. <code>device.id</code>). We normalized names using a standardized <a href="https://www.elastic.co/docs/reference/ecs/ecs-field-reference">Elastic field naming</a>.</li>
<li>Resolution variance: The resolution of the field extraction varied depending on how similar the input logs were to one another.</li>
</ul>
<h2>Log Format Fingerprint</h2>
<p>To address the challenge of log similarity, we introduce a high-performance heuristic: <strong>log format fingerprint (LFF)</strong>.</p>
<p>Instead of feeding raw, noisy logs directly into an LLM, we first apply a deterministic transformation to reveal the underlying structure of each message. This pre-processing step abstracts away variable data, generating a simplified &quot;fingerprint&quot; that allows us to group related logs.</p>
<p>The mapping logic is simple to ensure speed and consistency:</p>
<ol>
<li>Digit abstraction: Any sequence of digits (0-9) is replaced by a single ‘0’.</li>
<li>Text abstraction: Any sequence of alphabetical characters with whitespace is replaced by a single ‘a’.</li>
<li>Whitespace normalization: All sequences of whitespace (spaces, tabs, newlines) are collapsed into a single space.</li>
<li>Symbol preservation: Punctuation and special characters (e.g., :, [, ], /) are preserved, as they are often the strongest indicators of log structure.</li>
</ol>
<p>We introduce the log mapping approach. The basic mapping patterns include the following:</p>
<p>Digits 0-9 of any length -&gt; to ‘0.’</p>
<ul>
<li>Text (alphabetical characters with spaces) of any length -&gt; to ‘a’.</li>
<li>White spaces, tabs, and new lines -&gt; to a single space.</li>
<li>Let's look at an example of how this mapping allows us to transform the logs.</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/automated-log-parsing-ml-streams/transform-logs.png" alt="" /></p>
<p>As a result, we obtain the following log masks:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/automated-log-parsing-ml-streams/log-masks.png" alt="" /></p>
<p>Notice the fingerprints of the first two logs. Despite different timestamps, source classes, and message content, their prefixes (<code>0/0/0 0:0:0 a a.a:</code>) are identical. This structural alignment allows us to automatically bucket these logs into the same cluster.</p>
<p>The third log, however, produces a completely divergent fingerprint (<code>0-0-0...</code>). This allows us to algorithmically separate it from the first group before we ever invoke an LLM.</p>
<h2>Bonus Part: Instant Implementation with ES|QL</h2>
<p>It’s as easy as passing this query in Discover.</p>
<pre><code class="language-esql">FROM loghub |
EVAL pattern = REPLACE(REPLACE(REPLACE(REPLACE(raw_message, &quot;[ \t\n]+&quot;, &quot; &quot;), &quot;[A-Za-z]+&quot;, &quot;a&quot;), &quot;[0-9]+&quot;, &quot;0&quot;), &quot;a( a)+&quot;, &quot;a&quot;) |
STATS total_count = COUNT(), ratio = COUNT() / 2000.0, datasources=VALUES(filename), example=TOP(raw_message, 3, &quot;desc&quot;) BY SUBSTRING(pattern, 0, 15) |
SORT total_count DESC |
LIMIT 100
</code></pre>
<p><strong>Query breakdown:</strong></p>
<p><strong>FROM</strong> loghub: Targets our index containing the raw log data.</p>
<p><strong>EVAL</strong> pattern = …: The core mapping logic. We chain REPLACE functions to perform the abstraction (e.g., digits to '0', text to 'a', etc.) and save the result in a “pattern” field.</p>
<p><strong>STATS</strong> [column1 =] expression1, … <strong>BY</strong> SUBSTRING(pattern, 0, 15):</p>
<p>This is a clustering step. We group logs that share the first 15 characters of their pattern and create aggregated fields such as total log count per group, list of log datasources, pattern prefix, 3 log examples</p>
<p><strong>SORT</strong> total_count DESC | <strong>LIMIT</strong> 100 : Surfaces the top 100 most frequent log patterns</p>
<p>The query results on LogHub are displayed below:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/automated-log-parsing-ml-streams/query-results.png" alt="" />
<img src="https://www.elastic.co/observability-labs/assets/images/automated-log-parsing-ml-streams/results.png" alt="" /></p>
<p>As demonstrated in the visualization, this “LLM-free” approach partitions logs with high accuracy. It successfully clustered 10 out of 16 data sources (based on LogHub labels) completely (&gt;90%) and achieved majority clustering in 13 out of 16 sources (&gt;60%) —all without requiring additional cleaning, preprocessing, or fine-tuning.</p>
<p>Log format fingerprint offers a pragmatic, high-impact alternative and addition to sophisticated ML solutions like <a href="https://www.elastic.co/docs/reference/aggregations/search-aggregations-bucket-categorize-text-aggregation">log pattern analysis</a>. It provides immediate insights into log relationships and effectively manages large log clusters.</p>
<ul>
<li>Versatility as a primitive</li>
</ul>
<p>Thanks to <a href="https://www.elastic.co/blog/getting-started-elasticsearch-query-language">ES|QL</a> implementation, LFF serves both as a standalone tool for fast data diagnostics/visualisations, and as a building block in log analysis pipelines for high-volume use cases.</p>
<ul>
<li>Flexibility</li>
</ul>
<p>LFF is easy to customize and extend to capture specific patterns, i.e. hexadecimal numbers and IP addresses.</p>
<ul>
<li>Deterministic stability</li>
</ul>
<p>Unlike ML-based clustering algorithms, LFF logic is straightforward and deterministic. New incoming logs do not retroactively affect existing log clusters.</p>
<ul>
<li>Performance and Memory</li>
</ul>
<p>It requires minimal memory, no training or GPU making it ideal for real-time high-throughput environments.</p>
<h2>Combining Log Format Fingerprint with an LLM</h2>
<p>To validate the proposed hybrid architecture, each experiment contained a random 20% subset of the logs from each data source. This constraint simulates a real-world production environment where logs are processed in batches rather than as a monolithic historical dump.</p>
<p>The objective was to demonstrate that LFF acts as an effective compression layer. We aimed to prove that high-coverage parsing rules could be generated from small, curated samples and successfully generalized to the entire dataset.</p>
<h2>Execution Pipeline</h2>
<p>We implemented a multi-stage pipeline that filters, clusters, and applies stratified sampling to the data before it reaches the LLM.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/automated-log-parsing-ml-streams/automating-log-parsing-ai-excecusion-pipeline.png" alt="" /></p>
<ol>
<li>Two-stage hierarchical clustering</li>
</ol>
<ul>
<li>Subclasses (exact match): Logs are aggregated by identical fingerprints. Every log in one subclass shares the exact same format structure.</li>
<li>Outlier cleaning: We discard any subclasses that represent less than 5% of the total log volume. This ensures the LLM focuses on the dominant signal and won’t be sidetracked by noise or malformed logs.</li>
<li>Metaclasses (prefix match): Remaining subclasses are grouped into Metaclasses by the first N characters of the format fingerprint match. This grouping strategy effectively splits lexically similar formats under a single umbrella.We chose N=5 for Log parsing and N=15 for Log partitioning when data sources are unknown.</li>
</ul>
<ol start="2">
<li>Stratified sampling. Once the hierarchical tree is built, we construct the log sample for the LLM. The strategic goal is to maximize variance coverage while minimizing token usage.</li>
</ol>
<ul>
<li>We select representative logs from each valid subclass within the broader metaclass.</li>
<li>To manage an edge case of too numerous subclasses, we apply random down-sampling to fit the target window size.</li>
</ul>
<ol start="3">
<li>Rule generation Finally, we prompt the LLM to generate a regex parsing rule that fits all logs in the provided sample for each Metaclass. For our PoC, we used the GPT-4o mini model.</li>
</ol>
<h2>Experimental Results &amp; Observations</h2>
<p>We achieved 94% parsing accuracy and 91% partitioning accuracy on the Loghub dataset.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/automated-log-parsing-ml-streams/automating-log-parsing-results.png" alt="" /></p>
<p>The confusion matrix above illustrates log partitioning results. The vertical axis represents the actual data sources, and the horizontal axis represents the predicted data sources. The heatmap intensity corresponds to log volume, with lighter tiles indicating a higher count. The diagonal alignment demonstrates the model's high fidelity in source attribution, with minimal scattering.</p>
<h2>Our Performance Benchmarks Insights</h2>
<ul>
<li>Optimal baseline: a context window of 30–40 log samples per category proved to be the &quot;sweet spot,&quot; consistently producing robust parsing with both Regex and Grok patterns.</li>
<li>Input minimisation: we pushed the input size to 10 logs per category for Regex patterns and observed only 2% drop in parsing performance, confirming that diversity-based sampling is more critical than raw volume.</li>
</ul>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/automated-log-parsing-ml-streams/cover.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Unleash the power of Elastic and Amazon Kinesis Data Firehose to enhance observability and data analytics]]></title>
            <link>https://www.elastic.co/observability-labs/blog/aws-kinesis-data-firehose-observability-analytics</link>
            <guid isPermaLink="false">aws-kinesis-data-firehose-observability-analytics</guid>
            <pubDate>Thu, 18 May 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[AWS users can now leverage the new Amazon Kinesis Firehose Delivery Stream to directly ingest logs into Elastic Cloud in real time for centralized alerting, troubleshooting, and analytics across your cloud and on-premises infrastructure.]]></description>
            <content:encoded><![CDATA[<p>As more organizations leverage the Amazon Web Services (AWS) cloud platform and services to drive operational efficiency and bring products to market, managing logs becomes a critical component of maintaining visibility and safeguarding multi-account AWS environments. Traditionally, logs are stored in Amazon Simple Storage Service (Amazon S3) and then shipped to an external monitoring and analysis solution for further processing.</p>
<p>To simplify this process and reduce management overhead, AWS users can now leverage the new Amazon Kinesis Firehose Delivery Stream to ingest logs into Elastic Cloud in AWS in real time and view them in the Elastic Stack alongside other logs for centralized analytics. This eliminates the necessity for time-consuming and expensive procedures such as VM provisioning or data shipper operations.</p>
<p>Elastic Observability unifies logs, metrics, and application performance monitoring (APM) traces for a full contextual view across your hybrid <a href="https://www.elastic.co/blog/aws-service-metrics-monitor-observability-easy">AWS environments alongside their on-premises data sets</a>. Elastic Observability enables you to track and monitor performance <a href="https://www.elastic.co/observability/aws-monitoring">across a broad range of AWS services</a>, including AWS Lambda, Amazon Elastic Compute Cloud (EC2), Amazon Elastic Container Service (ECS), Amazon Elastic Kubernetes Service (EKS), Amazon Simple Storage Service (S3), Amazon Cloudtrail, Amazon Network Firewall, and more.</p>
<p>In this blog, we will walk you through how to use the Amazon Kinesis Data Firehose integration — <a href="https://aws.amazon.com/blogs/big-data/accelerate-data-insights-with-elastic-and-amazon-kinesis-data-firehose/">Elastic is listed in the Amazon Kinesis Firehose</a> drop-down list — to simplify your architecture and send logs to Elastic, so you can monitor and safeguard your multi-account AWS environments.</p>
<h2>Announcing the Kinesis Firehose method</h2>
<p>Elastic currently provides both agent-based and serverless mechanisms, and we are pleased to announce the addition of the Kinesis Firehose method. This new method enables customers to directly ingest logs from AWS into Elastic, supplementing our existing options.</p>
<ul>
<li><a href="https://www.youtube.com/watch?v=pnGXjljuEnY"><strong>Elastic Agent</strong></a> pulls metrics and logs from CloudWatch and S3 where logs are generally pushed from a service (for example, EC2, ELB, WAF, Route53) and ingests them into Elastic Cloud.</li>
<li><a href="https://www.elastic.co/blog/elastic-and-aws-serverless-application-repository-speed-time-to-actionable-insights-with-frictionless-log-ingestion-from-amazon-s3"><strong>Elastic’s Serverless Forwarder</strong></a> (runs Lambda and available in AWS SAR) sends logs from Kinesis Data Stream, Amazon S3, and AWS Cloudwatch log groups into Elastic. To learn more about this topic, please see this <a href="https://www.elastic.co/blog/elastic-and-aws-serverless-application-repository-speed-time-to-actionable-insights-with-frictionless-log-ingestion-from-amazon-s3">blog post</a>.</li>
<li><a href="https://docs.aws.amazon.com/firehose/latest/dev/what-is-this-service.html"><strong>Amazon Kinesis Firehose</strong></a> directly ingests logs from AWS into Elastic (specifically, if you are running the Elastic Cloud on AWS).</li>
</ul>
<p>In this blog, we will cover the last option since we have recently released the Amazon Kinesis Data Firehose integration. Specifically, we'll review:</p>
<ul>
<li>A general overview of the Amazon Kinesis Data Firehose integration and how it works with AWS</li>
<li>Step-by-step instructions to set up the Amazon Kinesis Data Firehose integration on AWS and on <a href="http://cloud.elastic.co">Elastic Cloud</a></li>
</ul>
<p>By the end of this blog, you'll be equipped with the knowledge and tools to simplify your AWS log management with Elastic Observability and Amazon Kinesis Data Firehose.</p>
<h2>Prerequisites and configurations</h2>
<p>If you intend to follow the steps outlined in this blog post, there are a few prerequisites and configurations that you should have in place beforehand.</p>
<ol>
<li>You will need an account on <a href="http://cloud.elastic.co">Elastic Cloud</a> and a deployed stack on AWS. Instructions for deploying a stack on AWS can be found <a href="https://www.elastic.co/guide/en/elastic-stack/current/installing-elastic-stack.html">here</a>. This is necessary for AWS Firehose Log ingestion.</li>
<li>You will also need an AWS account with the necessary permissions to pull data from AWS. Details on the required permissions can be found in our <a href="https://docs.elastic.co/en/integrations/aws#aws-permissions">documentation</a>.</li>
<li>Finally, be sure to turn on VPC Flow Logs for the VPC where your application is deployed and send them to AWS Firehose.</li>
</ol>
<h2>Elastic’s Amazon Kinesis Data Firehose integration</h2>
<p>Elastic has collaborated with AWS to offer a seamless integration of Amazon Kinesis Data Firehose with Elastic, enabling direct ingestion of data from Amazon Kinesis Data Firehose into Elastic without the need for Agents or Beats. All you need to do is configure the Amazon Kinesis Data Firehose delivery stream to send its data to Elastic's endpoint. In this configuration, we will demonstrate how to ingest VPC Flow logs and Firewall logs into Elastic. You can follow a similar process to ingest other logs from your AWS environment into Elastic.</p>
<p>There are three distinct configurations available for ingesting VPC Flow and Network firewall logs into Elastic. One configuration involves sending logs through CloudWatch, and another uses S3 and Kinesis Firehose; each has its own unique setup. With Cloudwatch and S3 you can store and forward but with Kinesis Firehose you will have to ingest immediately. However, in this blog post, we will focus on this new configuration that involves sending VPC Flow logs and Network Firewall logs directly to Elastic.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/aws-kinesis-data-firehose-observability-analytics/image2.png" alt="AWS elastic configuration" /></p>
<p>We will guide you through the configuration of the easiest setup, which involves directly sending VPC Flow logs and Firewalls logs to Amazon Kinesis Data Firehose and then into Elastic Cloud.</p>
<p><strong>Note:</strong> It's important to note that this setup is only compatible with Elastic Cloud on AWS and cannot be used with self-managed or on-premise or other cloud provider Elastic deployments.</p>
<h2>Setting it all up</h2>
<p>To begin setting up the integration between Amazon Kinesis Data Firehose and Elastic, let's go through the necessary steps.</p>
<h3>Step 0: Get an account on Elastic Cloud</h3>
<p>Create an account on Elastic Cloud by following the instructions provided to <a href="https://cloud.elastic.co/registration?fromURI=/home">get started on Elastic Cloud</a>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/aws-kinesis-data-firehose-observability-analytics/Screenshot_2023-05-18_at_6.00.28_PM.png" alt="elastic free trial" /></p>
<h3>Step 1: Deploy Elastic on AWS</h3>
<p>You can deploy Elastic on AWS via two different approaches: through the UI or through Terraform. We’ll start first with the UI option.</p>
<p>After logging into Elastic Cloud, create a deployment on Elastic. It's crucial to make sure that the deployment is on Elastic Cloud on AWS since the Amazon Kinesis Data Firehose connects to a specific endpoint that must be on AWS.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/aws-kinesis-data-firehose-observability-analytics/blog-elastic-create-a-deployment.png" alt="create a deployment" /></p>
<p>After your deployment is created, it's essential to copy the Elasticsearch endpoint to ensure a seamless configuration process.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/aws-kinesis-data-firehose-observability-analytics/blog-elastic-O11y-log.png" alt="O11y log" /></p>
<p>The Elasticsearch HTTP endpoint should be copied and used for Amazon Firehose destination configuration purposes, as it will be required. Here's an example of what the endpoint should look like:</p>
<pre><code class="language-bash">https://elastic-O11y-log.es.us-east-1.aws.found.io
</code></pre>
<h3><em>Alternative approach using Terraform</em></h3>
<p>An alternative approach to deploying Elastic Cloud on AWS is by using Terraform. It's also an effective way to automate and streamline the deployment process.</p>
<p>To begin, simply create a Terraform configuration file that outlines the necessary infrastructure. This file should include resources for your Elastic Cloud deployment and any required IAM roles and policies. By using this approach, you can simplify the deployment process and ensure consistency across environments.</p>
<p>One easy way to create your Elastic Cloud deployment with Terraform is to use this Github <a href="https://github.com/aws-ia/terraform-elastic-cloud">repo</a>. This resource lets you specify the region, version, and deployment template for your Elastic Cloud deployment, as well as any additional settings you require.</p>
<h3>Step 2: To turn on Elastic's AWS integrations, navigate to the Elastic Integration section in your deployment</h3>
<p>To install AWS assets in your deployment's Elastic Integration section, follow these steps:</p>
<ol>
<li>Log in to your Elastic Cloud deployment and open <strong>Kibana</strong>.</li>
<li>To get started, go to the <strong>management</strong> section of Kibana and click on &quot; <strong>Integrations.</strong>&quot;</li>
<li>Navigate to the <strong>AWS</strong> integration and click on the &quot;Install AWS Assets&quot; button in the <strong>settings</strong>.This step is important as it installs the necessary assets such as <strong>dashboards</strong> and <strong>ingest pipelines</strong> to enable data ingestion from AWS services into Elastic.</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/aws-kinesis-data-firehose-observability-analytics/blog-elastic-aws-settings.png" alt="aws settings" /></p>
<h3>Step 3: Set up the Amazon Kinesis Data Firehose delivery stream on the AWS Console</h3>
<p>You can set up the Kinesis Data Firehose delivery stream via two different approaches: through the AWS Management Console or through Terraform. We’ll start first with the console option.</p>
<p>To set up the Kinesis Data Firehose delivery stream on AWS, follow these <a href="https://docs.aws.amazon.com/firehose/latest/dev/create-destination.html#create-destination-elastic">steps</a>:</p>
<ol>
<li>
<p>Go to the AWS Management Console and select Amazon Kinesis Data Firehose.</p>
</li>
<li>
<p>Click on Create delivery stream.</p>
</li>
<li>
<p>Choose a delivery stream name and select Direct PUT or other sources as the source.</p>
</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/aws-kinesis-data-firehose-observability-analytics/blog-elastic-create-delivery-stream.png" alt="create delivery stream" /></p>
<ol start="4">
<li>
<p>Choose Elastic as the destination.</p>
</li>
<li>
<p>In the Elastic destination section, enter the Elastic endpoint URL that you copied from your Elastic Cloud deployment.</p>
</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/aws-kinesis-data-firehose-observability-analytics/blog-elastic-destination-settings.png" alt="destination settings" /></p>
<ol start="6">
<li>
<p>Choose the content encoding and retry duration as shown above.</p>
</li>
<li>
<p>Enter the appropriate parameter values for your AWS log type. For example, for VPC Flow logs, you would need to specify the _ <strong>es_datastream_name</strong> _ and _ <strong>logs-aws.vpc flow-default</strong> _.</p>
</li>
<li>
<p>Configure the Amazon S3 bucket as the source backup for the Amazon Kinesis Data Firehose delivery stream failed data or all data, and configure any required tags for the delivery stream.</p>
</li>
<li>
<p>Review the settings and click on Create delivery stream.</p>
</li>
</ol>
<p>In the example above, we are using the <strong>es_datastream_name</strong> parameter to pull in VPC Flow logs through the <strong>logs-aws.vpcflow-default</strong> datastream. Depending on your use case, this parameter can be configured with one of the following types of logs:</p>
<ul>
<li>logs-aws.cloudfront_logs-default (AWS CloudFront logs)</li>
<li>logs-aws.ec2_logs-default (EC2 logs in AWS CloudWatch)</li>
<li>logs-aws.elb_logs-default (Amazon Elastic Load Balancing logs)</li>
<li>logs-aws.firewall_logs-default (AWS Network Firewall logs)</li>
<li>logs-aws.route53_public_logs-default (Amazon Route 53 public DNS queries logs)</li>
<li>logs-aws.route53_resolver_logs-default (Amazon Route 53 DNS queries &amp; responses logs)</li>
<li>logs-aws.s3access-default (Amazon S3 server access log)</li>
<li>logs-aws.vpcflow-default (AWS VPC flow logs)</li>
<li>logs-aws.waf-default (AWS WAF Logs)</li>
</ul>
<h3><em>Alternative approach using Terraform</em></h3>
<p>Using the &quot; <strong>aws_kinesis_firehose_delivery_stream</strong>&quot; resource in <strong>Terraform</strong> is another way to create a Kinesis Firehose delivery stream, allowing you to specify the delivery stream name, data source, and destination - in this case, an Elasticsearch HTTP endpoint. To authenticate, you'll need to provide the endpoint URL and an API key. Leveraging this Terraform resource is a fantastic way to automate and streamline your deployment process, resulting in greater consistency and efficiency.</p>
<p>Here's an example code that shows you how to create a Kinesis Firehose delivery stream with Terraform that sends data to an Elasticsearch HTTP endpoint:</p>
<pre><code class="language-hcl">resource &quot;aws_kinesis_firehose_delivery_stream&quot; “Elasticcloud_stream&quot; {
  name        = &quot;terraform-kinesis-firehose-ElasticCloud-stream&quot;
  destination = &quot;http_endpoint”
  s3_configuration {
    role_arn           = aws_iam_role.firehose.arn
    bucket_arn         = aws_s3_bucket.bucket.arn
    buffer_size        = 5
    buffer_interval    = 300
    compression_format = &quot;GZIP&quot;
  }
  http_endpoint_configuration {
    url        = &quot;https://cloud.elastic.co/&quot;
    name       = “ElasticCloudEndpoint&quot;
    access_key = “ElasticApi-key&quot;
    buffering_hints {
      size_in_mb = 5
      interval_in_seconds = 300
    }

   role_arn       = &quot;arn:Elastic_role&quot;
   s3_backup_mode = &quot;FailedDataOnly&quot;
  }
}
</code></pre>
<h3>Step 4: Configure VPC Flow Logs to send to Amazon Kinesis Data Firehose</h3>
<p>To complete the setup, you'll need to configure VPC Flow logs in the VPC where your application is deployed and send them to the Amazon Kinesis Data Firehose delivery stream you set up in Step 3.</p>
<p>Enabling VPC flow logs in AWS is a straightforward process that involves several steps. Here's a step-by-step details to enable VPC flow logs in your AWS account:</p>
<ol>
<li>
<p>Select the VPC for which you want to enable flow logs.</p>
</li>
<li>
<p>In the VPC dashboard, click on &quot;Flow Logs&quot; under the &quot;Logs&quot; section.</p>
</li>
<li>
<p>Click on the &quot;Create Flow Log&quot; button to create a new flow log.</p>
</li>
<li>
<p>In the &quot;Create Flow Log&quot; wizard, provide the following information:</p>
</li>
</ol>
<p>Choose the target for your flow logs: In this case, Amazon Kinesis Data Firehose in the same AWS account.</p>
<ul>
<li>Provide a name for your flow log.</li>
<li>Choose the VPC and the network interface(s) for which you want to enable flow logs.</li>
<li>Choose the flow log format: either AWS default or Custom format.</li>
</ul>
<ol start="5">
<li>
<p>Configure the IAM role for the flow logs. If you have an existing IAM role, select it. Otherwise, create a new IAM role that grants the necessary permissions for the flow logs.</p>
</li>
<li>
<p>Review the flow log configuration and click &quot;Create.&quot;</p>
</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/aws-kinesis-data-firehose-observability-analytics/blog-elastic-flow-log-settings.png" alt="flow log settings" /></p>
<p>Create the VPC Flow log.</p>
<h3>Step 5: After a few minutes, check if flows are coming into Elastic</h3>
<p>To confirm that the VPC Flow logs are ingesting into Elastic, you can check the logs in Kibana. You can do this by searching for the index in the Kibana Discover tab and filtering the results by the appropriate index and time range. If VPC Flow logs are flowing in, you should see a list of documents representing the VPC Flow logs.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/aws-kinesis-data-firehose-observability-analytics/blog-elastic-expanded-document.png" alt="expanded document" /></p>
<h3>Step 6: Navigate to Kibana to see your logs parsed and visualized in the [Logs AWS] VPC Flow Log Overview dashboard</h3>
<p>Finally, there is an Elastic out-of-the-box (OOTB) VPC Flow logs dashboard that displays the top IP addresses that are hitting your VPC, their geographic location, time series of the flows, and a summary of VPC flow log rejects within the selected time frame. This dashboard can provide valuable insights into your network traffic and potential security threats.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/aws-kinesis-data-firehose-observability-analytics/blog-elastic-VPC-flow-log-map.png" alt="vpc flow log map" /></p>
<p><em>Note: For additional VPC flow log analysis capabilities, please refer to</em> <a href="https://www.elastic.co/blog/vpc-flow-logs-monitoring-analytics-observability"><em>this blog</em></a><em>.</em></p>
<h3>Step 7: Configure AWS Network Firewall Logs to send to Kinesis Firehose</h3>
<p>To create a Kinesis Data Firehose delivery stream for AWS Network firewall logs, first log in to the AWS Management Console, navigate to the Kinesis service, select &quot;Data Firehose&quot;, and follow the step-by-step instructions as shown in Step 3. Specify the Elasticsearch endpoint, API key, add a parameter (_ <strong>es_datastream_name=logs-aws.firewall_logs-default</strong> _), and create the delivery stream.</p>
<p>Second, to set up a Network Firewall rule group to send logs to the Kinesis Firehose, go to the Network Firewall section of the console, create a rule group, add a rule to allow traffic to the Kinesis endpoint, and attach the rule group to your Network Firewall configuration. Finally, test the configuration by sending traffic through the Network Firewall to the Kinesis Firehose endpoint and verify that logs are being delivered to your S3 bucket.</p>
<p>Kindly follow the instructions below to set up a firewall rule and logging.</p>
<ol>
<li>Set up a Network Firewall rule group to send logs to Amazon Kinesis Data Firehose:</li>
</ol>
<ul>
<li>Go to the AWS Management Console and select Network Firewall.</li>
<li>Click on &quot;Rule groups&quot; in the left menu and then click &quot;Create rule group.&quot;</li>
<li>Choose &quot;Stateless&quot; or &quot;Stateful&quot; depending on your requirements, and give your rule group a name. Click &quot;Create rule group.&quot;</li>
<li>Add a rule to the rule group to allow traffic to the Kinesis Firehose endpoint. For example, if you are using the us-east-1 region, you would add a rule like this:json</li>
</ul>
<pre><code class="language-json">{
  &quot;RuleDefinition&quot;: {
    &quot;Actions&quot;: [
      {
        &quot;Type&quot;: &quot;AWS::KinesisFirehose::DeliveryStream&quot;,
        &quot;Options&quot;: {
          &quot;DeliveryStreamArn&quot;: &quot;arn:aws:firehose:us-east-1:12387389012:deliverystream/my-delivery-stream&quot;
        }
      }
    ],
    &quot;MatchAttributes&quot;: {
      &quot;Destination&quot;: {
        &quot;Addresses&quot;: [&quot;api.firehose.us-east-1.amazonaws.com&quot;]
      },
      &quot;Protocol&quot;: {
        &quot;Numeric&quot;: 6,
        &quot;Type&quot;: &quot;TCP&quot;
      },
      &quot;PortRanges&quot;: [
        {
          &quot;From&quot;: 443,
          &quot;To&quot;: 443
        }
      ]
    }
  },
  &quot;RuleOptions&quot;: {
    &quot;CustomTCPStarter&quot;: {
      &quot;Enabled&quot;: true,
      &quot;PortNumber&quot;: 443
    }
  }
}
</code></pre>
<ul>
<li>Save the rule group.</li>
</ul>
<ol start="2">
<li>Attach the rule group to your Network Firewall configuration:</li>
</ol>
<ul>
<li>Go to the AWS Management Console and select Network Firewall.</li>
<li>Click on &quot;Firewall configurations&quot; in the left menu and select the configuration you want to attach the rule group to.</li>
<li>Scroll down to &quot;Associations&quot; and click &quot;Edit.&quot;</li>
<li>Select the rule group you created in Step 2 and click &quot;Save.&quot;</li>
</ul>
<ol start="3">
<li>Test the configuration:</li>
</ol>
<ul>
<li>Send traffic through the Network Firewall to the Kinesis Firehose endpoint and verify that logs are being delivered to your S3 bucket.</li>
</ul>
<h3>Step 8: Navigate to Kibana to see your logs parsed and visualized in the [Logs AWS] Firewall Log dashboard</h3>
<p><img src="https://www.elastic.co/observability-labs/assets/images/aws-kinesis-data-firehose-observability-analytics/blog-elastic-firewall-log-dashboard.png" alt="firewall log dashboard" /></p>
<h2>Wrapping up</h2>
<p>We’re excited to bring you this latest integration for AWS Cloud and Kinesis Data Firehose into production. The ability to consolidate logs and metrics to gain visibility across your cloud and on-premises environment is crucial for today’s distributed environments and applications.</p>
<p>From EC2, Cloudwatch, Lambda, ECS and SAR, <a href="https://www.elastic.co/integrations/data-integrations?solution=all-solutions&amp;category=aws">Elastic Integrations</a> allow you to quickly and easily get started with ingesting your telemetry data for monitoring, analytics, and observability. Elastic is constantly delivering frictionless customer experiences, allowing anytime, anywhere access to all of your telemetry data — this streamlined, native integration with AWS is the latest example of our commitment.</p>
<h2>Start a free trial today</h2>
<p>You can begin with a <a href="https://aws.amazon.com/marketplace/pp/prodview-voru33wi6xs7k">7-day free trial</a> of Elastic Cloud within the AWS Marketplace to start monitoring and improving your users' experience today!</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/aws-kinesis-data-firehose-observability-analytics/image2.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[AWS VPC Flow log analysis with GenAI in Elastic]]></title>
            <link>https://www.elastic.co/observability-labs/blog/aws-vpc-flow-log-analysis-with-genai-elastic</link>
            <guid isPermaLink="false">aws-vpc-flow-log-analysis-with-genai-elastic</guid>
            <pubDate>Fri, 07 Jun 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Elastic has a set of embedded capabilities such as a GenAI RAG-based AI Assistant and a machine learning platform as part of the product baseline. These make analyzing the vast number of logs you get from AWS VPC Flows easier.]]></description>
            <content:encoded><![CDATA[<p>Elastic Observability provides a full observability solution, by supporting metrics, traces and logs for applications and infrastructure. In managing AWS deployments, VPC flow logs are critical in managing performance, network visibility, security, compliance, and overall management of your AWS environment. Several examples of :</p>
<ol>
<li>
<p>Where traffic is coming in from and going out to from the deployment, and within the deployment. This helps identify unusual or unauthorized communications</p>
</li>
<li>
<p>Traffic volumes detecting spikes or drops which could indicate service issues in production or an increase in customer traffic</p>
</li>
<li>
<p>Latency and Performance bottlenecks - with VPC Flow logs, you can look at latency for a flow (in and outflows), and understand patterns</p>
</li>
<li>
<p>Accepted and rejected traffic helps determine where potential security threats and misconfigurations lie. </p>
</li>
</ol>
<p>AWS VPC Logs is a great example of how logs are great. Logging is an important part of Observability, for which we generally think of metrics and tracing. However, the amount of logs an application and the underlying infrastructure output can be significantly daunting with VPC Logs. However, it also provides a significant amount of insight.</p>
<p>Before we proceed, it is important to understand what Elastic provides in managing AWS and VPC Flow logs:</p>
<ol>
<li>
<p>A full set of integrations to manage VPC Flows and the <a href="https://www.elastic.co/observability-labs/blog/aws-service-metrics-monitor-observability-easy">entire end-to-end deployment on AWS</a>. </p>
</li>
<li>
<p>Elastic has a simple-to-use <a href="https://www.elastic.co/observability-labs/blog/aws-kinesis-data-firehose-observability-analytics">AWS Firehose integration</a>. </p>
</li>
<li>
<p>Elastic’s tools such as <a href="https://www.elastic.co/observability-labs/blog/vpc-flow-logs-monitoring-analytics-observability">Discover, spike analysis,  and anomaly detection help provide you with better insights and analysis</a>.</p>
</li>
<li>
<p>And a set of simple <a href="https://www.elastic.co/guide/en/observability/current/monitor-amazon-vpc-flow-logs.html#aws-firehose-dashboard">Out-of-the-box dashboards</a></p>
</li>
</ol>
<p>In today’s blog, we’ll cover how Elastics’ other features can support analyzing and RCA for potential VPC flow logs even more easily. Specifically, we will focus on managing the number of rejects, as this helps ensure there weren’t any unauthorized or unusual activities:</p>
<ol>
<li>
<p>Set up an easy-to-use SLO (newly released) to detect when things are potentially degrading</p>
</li>
<li>
<p>Create an ML job to analyze different fields of the VPC Flow log</p>
</li>
<li>
<p>Using our newly released RAG-based AI Assistant to help analyze the logs without needing to know Elastic’s query language nor how to even graph on Elastic</p>
</li>
<li>
<p>ES|QL will help understand and analyze add latency for patterns.</p>
</li>
</ol>
<p>In subsequent blogs, we will use AI Assistant and ESQL to show how to get other insights beyond just REJECT/ACCEPT from VPC Flow log.</p>
<h2>Prerequisites and config</h2>
<p>If you plan on following this blog, here are some of the components and details we used to set up this demonstration:</p>
<ul>
<li>
<p>Ensure you have an account on <a href="http://cloud.elastic.co">Elastic Cloud</a> and a deployed stack (<a href="https://www.elastic.co/guide/en/elastic-stack/current/installing-elastic-stack.html">see instructions here</a>).</p>
</li>
<li>
<p>Follow the steps in the following blog to get <a href="https://github.com/aws-samples/aws-three-tier-web-architecture-workshop">AWS’s three-tier app</a> installed instructed in git, and bring in the <a href="https://www.elastic.co/observability-labs/blog/aws-kinesis-data-firehose-observability-analytics">AWS VPC Flow logs</a>.</p>
</li>
<li>
<p>Ensure you have an <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/ml-settings.html">ML node configured</a> in your Elastic stack</p>
</li>
<li>
<p>To use the AI Assistant you will need a trial or upgrade to Platinum.</p>
</li>
</ul>
<h2>SLO with VPC Flow Logs</h2>
<p>Elastic’s SLO capability is based directly on the Google SRE Handbook. All the definitions and semantics are utilized as described in Google’s SRE handbook. Hence users can perform the following on SLOs in Elastic:</p>
<ul>
<li>Define an SLO on Logs not just metrics - Users can use KQL (log-based query), service availability, service latency, custom metric, histogram metric, or a timeslice metric.</li>
<li>Define SLO, SLI, Error budget and burn rates. Users can also use occurrence versus time slice-based budgeting. </li>
<li>Manage, with dashboards, all the SLOs in a singular location.</li>
<li>Trigger alerts from the defined SLO, whether the SLI is off, the burn rate is used up, or the error rate is X.</li>
</ul>
<p>Setting up an SLO for VPC is easy. You simply create a query you want to trigger off. In our case, we look for all the good events where <em>aws.vpcflow.action=ACCEPT</em> and we define the target at 85%. </p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/aws-vpc-flow-log-analysis-with-genai-elastic/VPCFlowSLOsetup.png" alt="Setting up SLO for VPC FLow log" /></p>
<p>As the following example shows, over the last 7 days, we have exceeded our budget by 43%. Additionally, we have not complied for the last 7 days.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/aws-vpc-flow-log-analysis-with-genai-elastic/VPCFlowSLOMiss.png" alt="VPC Flow Reject SLO" /></p>
<h2>Analyzing the SLO with AI Assistant</h2>
<p>Now that we see that there is an issue with the VPC Flows, we immediately work with the AI Assistant to start analyzing the SLO. Because it's a chat interface we simply open the AI Assistant and work through some simple analysis: (See Animated GIF for a demo below)</p>
<h3>AI Assistant analysis:</h3>
<ul>
<li>
<p><strong>what were the top 3 source.address that had <em>aws.vpcflow.action=REJECT</em> over the last 7 days, which is causing this SLO issue?</strong> - We wanted to simply see what could be causing the loss in error budget. Were there any particular source.addresses causing a heavy reject rate.</p>
<ul>
<li>
<p>The answer: A table with the highest count = 42670 and <em>source.address = 79.110.62.185</em></p>
</li>
<li>
<p>There is one singular <em>source.address</em> that is causing the loss in SLO. </p>
</li>
</ul>
</li>
<li>
<p><strong>What is the largest number of  <em>aws.vpcflow.action=REJECT</em> in a 30 min time frame for the last 3 days where the <em>source.address=79.110.62.185</em>?</strong> - After understanding that a specific source.address is causing the loss in SLO, we want to understand the averages. </p>
<ul>
<li>
<p>**The answer: ** &quot;The largest number of <em>aws.vpcflow.action=REJECT</em> in a 30-minute time frame for the last 3 days where the <em>source.address</em> is 79.110.62.185 is 229. This occurred on 2024-06-01T04:00:00.000Z.”</p>
</li>
<li>
<p>It means there must be a low REJECT rate but fairly consistent vs spiky over the last 7 days.</p>
</li>
</ul>
</li>
<li>
<p><strong>for the logs with <em>source.address</em>=&quot;79.110.62.185&quot; was there any country code of <em>source.geo.country_iso_code</em> field present. If yes what is the value</strong> - Given the last question showed a low REJECT rate, it only means that this was fairly consistent vs spiky over the last 7 days.</p>
<ul>
<li><strong>The answer:</strong> Yes, there is a country code present in the <em>source.geo.country_iso_code</em> field for logs with <em>source.address</em>=&quot;79.110.62.185&quot;. The value is BG (Bulgaria).</li>
</ul>
</li>
<li>
<p><strong>Is there a specific destination.address where <em>source.address=79.110.62.185</em> is getting a <em>aws.vpcflow.action=REJECT</em>. Give me both the destination.address and the number of REJECTs for that destination.address?</strong></p>
<ul>
<li><strong>The answer:</strong> destination.address of 10.0.0.27 is giving a reject number of 53433 in this time frame.</li>
</ul>
</li>
<li>
<p><strong>Graph the number of REJECT vs ACCEPT for <em>source.address</em>=&quot;79.110.62.185&quot; over the last 7 days. The graph is on a daily basis in a singular graph</strong> - We asked this question to see what the comparison is between ACCEPT and REJECT. </p>
<ul>
<li><strong>The answer:</strong> See the animated GIF to see that the generated graph is fairly stable</li>
</ul>
</li>
<li>
<p><strong>Were there any source.address that had a spike, high reject rate in. a 30min period over the 30 days?</strong> - We wanted to see if there was any other spike </p>
<ul>
<li><strong>The answer</strong> - Yes, there was a source.address that had a spike in high reject rates in a 30-minute period over the last 30 days. <em>source.address</em>: 185.244.212.67, Reject Count: 8975, Time Period: 2024-05-22T03:00:00.000Z</li>
</ul>
</li>
</ul>
<hr />
<h3>Watch the flow</h3>
&lt;Video vidyardUuid=&quot;1jvEpzfkci9j6AoL42XWA3&quot; /&gt;
<h3>Potential issue:</h3>
<p>he server handling requests from source <strong><em>79.110.62.185</em></strong> is potentially having an issue.</p>
<p>Again using logs, we essentially asked the AI Assistant to give the <em>eni</em> ids where the internal ip address was 10.0.0.27</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/aws-vpc-flow-log-analysis-with-genai-elastic/VPCFlow-findingwebserver.png" alt="Finding the issue - webserver" /></p>
<p>From our AWS console, we know that this is the webserver. Further analysis in Elastic, and with the developers we realized there is a new version that was installed recently causing a problem with connections.</p>
<h2>Locating anomalies with ML</h2>
<p>While using the AI Assistant is great for analyzing information, another important aspect of VPC flow management is to ensure you can manage log spikes and anomalies. Elastic has a machine learning platform that allows you to develop jobs to analyze specific metrics or multiple metrics to look for anomalies.</p>
<p>VPC Flow logs come with a large amount of information. The full set of fields is listed in <a href="https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs.html#flow-logs-basics">AWS docs</a>. We will use a specific subset to help detect anomalies.</p>
<p>We were setting up anomalies for aws.vpcflow.action=REJECT, which requires us to use multimetric anomaly detection in Elastic.</p>
<p>The config we used utilizes:</p>
<p>Detectors:</p>
<ul>
<li>
<p>destination.address</p>
</li>
<li>
<p>destination.port</p>
</li>
</ul>
<p>Influencers:</p>
<ul>
<li>
<p>source.address</p>
</li>
<li>
<p>aws.vpcflow.action</p>
</li>
<li>
<p>destination.geo.region_iso_code</p>
</li>
</ul>
<p>The way we set this up will help us understand if there is a large spike in REJECT/ACCEPT against <em>destination.address</em> values from a specific <em>source.address</em> and/or <em>destination.geo.region_iso_code</em> location.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/aws-vpc-flow-log-analysis-with-genai-elastic/VPCFlowanomalysetup.png" alt="Anomaly detection job config" /></p>
<p>The job once run reveals something interesting:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/aws-vpc-flow-log-analysis-with-genai-elastic/VPCFlowAnomalyDetection.png" alt="Anomaly detected" /></p>
<p>Notice that <em>source.address</em> 185.244.212.67 has had a high REJECT rate in the last 30 days. </p>
<p>Notice where we found this before? In the AI Assistant!!!!!</p>
<p>While we can run the AI Assistant and find this sort of anomaly, the ML job can be setup to run continuously and alert us on such spikes. This will help us understand if there are any issues with the webserver like we found above or even potential security attacks.</p>
<h2>Conclusion:</h2>
<p>You’ve now seen how easily Elastic’s RAG-based AI Assistant can help analyze VPC Flows without even the need to know query syntax, understand where the data is, and understand even the fields. Additionally, you’ve also seen how we can alert you when a potential issue or degradation in service (SLO). Check out our other blogs on AWS VPC Flow analysis in Elastic:</p>
<ol>
<li>
<p>A full set of integrations to manage VPC Flows and the <a href="https://www.elastic.co/observability-labs/blog/aws-service-metrics-monitor-observability-easy">entire end-to-end deployment on AWS</a>. </p>
</li>
<li>
<p>Elastic has a simple-to-use <a href="https://www.elastic.co/observability-labs/blog/aws-kinesis-data-firehose-observability-analytics">AWS Firehose integration</a>. </p>
</li>
<li>
<p>Elastic’s tools such as <a href="https://www.elastic.co/observability-labs/blog/vpc-flow-logs-monitoring-analytics-observability">Discover, spike analysis,  and anomaly detection help provide you with better insights and analysis</a>.</p>
</li>
<li>
<p>And a set of simple <a href="https://www.elastic.co/guide/en/observability/current/monitor-amazon-vpc-flow-logs.html#aws-firehose-dashboard">Out-of-the-box dashboards</a></p>
</li>
</ol>
<h2>Try it out</h2>
<p>Existing Elastic Cloud customers can access many of these features directly from the <a href="https://cloud.elastic.co/">Elastic Cloud console</a>. Not taking advantage of Elastic on the cloud? <a href="https://www.elastic.co/cloud/cloud-trial-overview">Start a free trial</a>.</p>
<p>All of this is also possible in your environment. <a href="https://www.elastic.co/observability/universal-profiling">Learn how to get started today</a>.</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
<p><em>In this blog post, we may have used or referred to third party generative AI tools, which are owned and operated by their respective owners. Elastic does not have any control over the third party tools and we have no responsibility or liability for their content, operation or use, nor for any loss or damage that may arise from your use of such tools. Please exercise caution when using AI tools with personal, sensitive or confidential information. Any data you submit may be used for AI training or other purposes. There is no guarantee that information you provide will be kept secure or confidential. You should familiarize yourself with the privacy practices and terms of use of any generative AI tools prior to use.</em></p>
<p><em>Elastic, Elasticsearch, ESRE, Elasticsearch Relevance Engine and associated marks are trademarks, logos or registered trademarks of Elasticsearch N.V. in the United States and other countries. All other company and product names are trademarks, logos or registered trademarks of their respective owners.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/aws-vpc-flow-log-analysis-with-genai-elastic/21-cubes.jpeg" length="0" type="image/jpeg"/>
        </item>
        <item>
            <title><![CDATA[Best Practices for Log Management: Leveraging Logs for Faster Problem Resolution]]></title>
            <link>https://www.elastic.co/observability-labs/blog/best-practices-logging</link>
            <guid isPermaLink="false">best-practices-logging</guid>
            <pubDate>Wed, 11 Sep 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Explore effective log management strategies to improve system reliability and performance. Learn about data collection, processing, analysis, and cost-effective management of logs in complex software environments.]]></description>
            <content:encoded><![CDATA[<p>In today's rapid software development landscape, efficient log management is crucial for maintaining system reliability and performance. With expanding and complex infrastructure and application components, the responsibilities of operations and development teams are ever-growing and multifaceted. This blog post outlines best practices for effective log management, addressing the challenges of growing data volumes, complex infrastructures, and the need for quick problem resolution.</p>
<h2>Understanding Logs and Their Importance</h2>
<p>Logs are records of events occurring within your infrastructure, typically including a timestamp, a message detailing the event, and metadata identifying the source. They are invaluable for diagnosing issues, providing early warnings, and speeding up problem resolution. Logs are often the primary signal that developers enable, offering significant detail for debugging, performance analysis, security, and compliance management.</p>
<h2>The Logging Journey</h2>
<p>The logging journey involves three basic steps: collection and ingestion, processing and enrichment, and analysis and rationalization. Let's explore each step in detail, covering some of the best practices for each section.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/best-practices-logging/blog-elastic-collection-and-ingest.png" alt="Logging Journey" /></p>
<h3>1. Log Collection and Ingestion</h3>
<h4>Collect Everything Relevant and Actionable</h4>
<p>The first step is to collect all logs into a central location. This involves identifying all your applications and systems and collecting their logs. Comprehensive data collection ensures no critical information is missed, providing a complete picture of your system's behavior. In the event of an incident, having all logs in one place can significantly reduce the time to resolution. It's generally better to collect more data than you need, as you can always filter out irrelevant information later, as well as delete logs that are no longer needed more quickly.</p>
<h4>Leverage Integrations</h4>
<p>Elastic provides over 300 integrations that simplify data onboarding. These integrations not only collect data but also come with dashboards, saved searches, and pipelines to parse the data. Utilizing these integrations can significantly reduce manual effort and ensure data consistency.</p>
<h4>Consider Ingestion Capacity and Costs</h4>
<p>An important aspect of log collection is ensuring you have sufficient ingestion capacity at a manageable cost. When assessing solutions, be cautious about those that charge significantly more for high cardinality data, as this can lead to unexpectedly high costs in observability solutions. We'll talk more about cost effective log management later in this post.</p>
<h4>Use Kafka for Large Projects</h4>
<p>For larger organizations, implementing Kafka can improve log data management. Kafka acts as a buffer, making the system more reliable and easier to manage. It allows different teams to send data to a centralized location, which can then be ingested into Elastic.</p>
<h3>2. Processing and Enrichment</h3>
<h4>Adopt Elastic Common Schema (ECS)</h4>
<p>One key aspect of log collection is to have the most amount of normalization across all of your applications and infrastructure. Having a common semantic schema is crucial. Elastic contributed Elastic Common Schema (ECS) to OpenTelemetry (OTel), helping accelerate the adoption of OTel-based observability and security. This move towards a more normalized way to define and ingest logs (including metrics and traces) is beneficial for the industry.</p>
<p>Using ECS helps standardize field names and data structures, making data analysis and correlation easier. This common schema ensures your data is organized predictably, facilitating more efficient querying and reporting. Learn more about ECS <a href="https://www.elastic.co/guide/en/ecs/current/ecs-reference.html">here</a>.</p>
<h4>Optimize Mappings for High Volume Data</h4>
<p>For high cardinality fields or those rarely used, consider optimizing or removing them from the index. This can improve performance by reducing the amount of data that needs to be indexed and searched. Our documentation has sections to tune your setup for <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-disk-usage.html">disk usage</a>, <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-search-speed.html">search speed</a> and <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-indexing-speed.html">indexing speed</a>.</p>
<h4>Managing Structured vs. Unstructured Logs</h4>
<p>Structured logs are generally preferable as they offer more value and are easier to work with. They have a predefined format and fields, simplifying information extraction and analysis. For custom logs without pre-built integrations, you may need to define your own parsing rules.</p>
<p>For unstructured logs, full-text search capabilities can help mitigate limitations. By indexing logs, full-text search allows users to search for specific keywords or phrases efficiently, even within large volumes of unstructured data. This is one of the main differentiators of Elastic's observability solution. You can simply search for any keyword or phrase and get results in real-time, without needing to write complex regular expressions or parsing rules at query time.</p>
<h4>Schema-on-Read vs. Schema-on-Write</h4>
<p>There are two main approaches to processing log data:</p>
<ol>
<li>
<p>Schema-on-read: Some observability dashboarding capabilities can perform runtime transformations to extract fields from non-parsed sources on the fly. This is helpful when dealing with legacy systems or custom applications that may not log data in a standardized format. However, runtime parsing can be time-consuming and resource-intensive, especially for large volumes of data.</p>
</li>
<li>
<p>Schema-on-write: This approach offers better performance and more control over the data. The schema is defined upfront, and the data is structured and validated at the time of writing. This allows for faster processing and analysis of the data, which is beneficial for enrichment.</p>
</li>
</ol>
<h3>3. Analysis and Rationalization</h3>
<h4>Full-Text Search</h4>
<p>Elastic's full-text search capabilities, powered by Elasticsearch, allow you to quickly find relevant logs. The Kibana Query Language (KQL) enhances search efficiency, enabling you to filter and drill down into the data to identify issues rapidly.</p>
<p>Here are a few examples of KQL queries:</p>
<pre><code>// Filter documents where a field exists
http.request.method: *

// Filter documents that match a specific value
http.request.method: GET

// Search all fields for a specific value
Hello

// Filter documents where a text field contains specific terms
http.request.body.content: &quot;null pointer&quot;

// Filter documents within a range
http.response.bytes &lt; 10000

// Combine range queries
http.response.bytes &gt; 10000 and http.response.bytes &lt;= 20000

// Use wildcards to match patterns
http.response.status_code: 4*

// Negate a query
not http.request.method: GET

// Combine multiple queries with AND/OR
http.request.method: GET and http.response.status_code: 400
</code></pre>
<h4>Machine Learning Integration</h4>
<p>Machine learning can automate the detection of anomalies and patterns within your log data. Elastic offers features like log rate analysis that automatically identify deviations from normal behavior. By leveraging machine learning, you can proactively address potential issues before they escalate.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/best-practices-logging/screenshot-machine-learning-smv-anomaly.png" alt="Machine Learning" /></p>
<p>It is recommended that organizations utilize a diverse arsenal of machine learning algorithms and techniques to effectively uncover unknown-unknowns in log files. Unsupervised machine learning algorithms, should be employed for anomaly detection on real-time data, with rate-controlled alerting based on severity.</p>
<p>By automatically identifying influencers, users can gain valuable context for automated root cause analysis (RCA). Log pattern analysis brings categorization to unstructured logs, while log rate analysis and change point detection help identify the root causes of spikes in log data.</p>
<p>Take a look at the <a href="https://www.elastic.co/guide/en/machine-learning/current/ml-overview.html">documentation</a> to get started with machine learning in Elastic.</p>
<h4>Dashboarding and Alerting</h4>
<p>Building dashboards and setting up alerting helps you monitor your logs in real-time. Dashboards provide a visual representation of your logs, making it easier to identify patterns and anomalies. Alerting can notify you when specific events occur, allowing you to take action quickly.</p>
<h2>Cost-Effective Log Management</h2>
<h3>Use Data Tiers</h3>
<p>Implementing index lifecycle management to move data across hot, warm, cold, and frozen tiers can significantly reduce storage costs. This approach ensures that only the most frequently accessed data is stored on expensive, high-performance storage, while older data is moved to more cost-effective storage solutions.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/best-practices-logging/ilm.png" alt="ILM" /></p>
<p>Our documentation explains how to set up <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/index-lifecycle-management.html">Index Lifecycle Management</a>.</p>
<h3>Compression and Index Sorting</h3>
<p>Applying best compression settings and using index sorting can further reduce the data footprint. Optimizing the way data is stored on disk can lead to substantial savings in storage costs and improve retrieval performance. As of 8.15, Elasticsearch provides an indexing mode called &quot;logsdb&quot;. This is a highly optimized way of storing log data. This new way of indexing data uses 2.5 times less disk space than the default mode. You can read more about it <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/logs-data-stream.html">here</a>. This mode automatically applies the best combination of settings for compression, index sorting, and other optimizations that weren't accessible to users before.</p>
<h3>Snapshot Lifecycle Management (SLM)</h3>
<p><img src="https://www.elastic.co/observability-labs/assets/images/best-practices-logging/slm.png" alt="SLM" /></p>
<p>SLM allows you to back up your data and delete it from the main cluster, freeing up resources. If needed, data can be restored quickly for analysis, ensuring that you maintain the ability to investigate historical events without incurring high storage costs.</p>
<p>Learn more about SLM in the <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/snapshot-lifecycle-management.html">documentation</a>.</p>
<h3>Dealing with Large Amounts of Log Data</h3>
<p>Managing large volumes of log data can be challenging. Here are some strategies to optimize log management:</p>
<ol>
<li>Develop a logs deletion policy. Evaluate what data to collect and when to delete it.</li>
<li>Consider discarding DEBUG logs or even INFO logs earlier, and delete dev and staging environment logs sooner.</li>
<li>Aggregate short windows of identical log lines, which is especially useful for TCP security event logging.</li>
<li>For applications and code you control, consider moving some logs into traces to reduce log volume while maintaining detailed information.</li>
</ol>
<h3>Centralized vs. Decentralized Log Storage</h3>
<p>Data locality is an important consideration when managing log data. The costs of ingressing and egressing large amounts of log data can be prohibitively high, especially when dealing with cloud providers.</p>
<p>In the absence of regional redundancy requirements, your organization may not need to send all log data to a central location. Consider keeping log data local to the datacenter where it was generated to reduce ingress and egress costs.</p>
<p><a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-cross-cluster-search.html">Cross-cluster search</a> functionality enables users to search across multiple logging clusters simultaneously, reducing the amount of data that needs to be transferred over the network.</p>
<p><a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/xpack-ccr.html">Cross-cluster replication</a> is useful for maintaining business continuity in the event of a disaster, ensuring data availability even during an outage in one datacenter.</p>
<h2>Monitoring and Performance</h2>
<h3>Monitor Your Log Management System</h3>
<p>Using a dedicated monitoring cluster can help you track the performance of your Elastic deployment. <a href="https://www.elastic.co/guide/en/kibana/current/xpack-monitoring.html">Stack monitoring</a> provides metrics on search and indexing activity, helping you identify and resolve performance bottlenecks.</p>
<h3>Adjust Bulk Size and Refresh Interval</h3>
<p><a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-indexing-speed.html">Optimizing these settings</a> can balance performance and resource usage. Increasing bulk size and refresh interval can improve indexing efficiency, especially for high-throughput environments.</p>
<h2>Logging Best Practices</h2>
<h3>Adjust Log Levels</h3>
<p>Ensure that log levels are appropriately set for all applications. Customize log formats to facilitate easier ingestion and analysis. Properly configured log levels can reduce noise and make it easier to identify critical issues.</p>
<h3>Use Modern Logging Frameworks</h3>
<p>Implement logging frameworks that support structured logging. Adding metadata to logs enhances their usefulness for analysis. Structured logging formats, such as JSON, allow logs to be easily parsed and queried, improving the efficiency of log analysis.
If you fully control the application and are already using structured logging, consider using <a href="https://github.com/elastic/ecs-logging">Elastic's version of these libraries</a>, which can automatically parse logs into ECS fields.</p>
<h3>Leverage APM and Metrics</h3>
<p>For custom-built applications, Application Performance Monitoring (APM) provides deeper insights into application performance, complementing traditional logging. APM tracks transactions across services, helping you understand dependencies and identify performance bottlenecks.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/best-practices-logging/apm.png" alt="APM" /></p>
<p>Consider collecting metrics alongside logs. Metrics can provide insights into your system's performance, such as CPU usage, memory usage, and network traffic. If you're already collecting logs from your systems, adding metrics collection is usually a quick process.</p>
<p>Traces can provide even deeper insights into specific transactions or request paths, especially in cloud-native environments. They offer more contextual information and excel at tracking dependencies across services. However, implementing tracing is only possible for applications you own, and not all developers have fully embraced it yet.</p>
<p>A combined logging and tracing strategy is recommended, where traces provide coverage for newer instrumented apps, and logging supports legacy applications and systems you don't own the source code for.</p>
<h2>Conclusion</h2>
<p>Effective log management is essential for maintaining system reliability and performance in today's complex software environments. By following these best practices, you can optimize your log management process, reduce costs, and improve problem resolution times.</p>
<p>Key takeaways include:</p>
<ul>
<li>Ensure comprehensive log collection with a focus on normalization and common schemas.</li>
<li>Use appropriate processing and enrichment techniques, balancing between structured and unstructured logs.</li>
<li>Leverage full-text search and machine learning for efficient log analysis.</li>
<li>Implement cost-effective storage strategies and smart data retention policies.</li>
<li>Enhance your logging strategy with APM, metrics, and traces for a complete observability solution.</li>
</ul>
<p>Continuously evaluate and adjust your strategies to keep pace with the growing volume and complexity of log data, and you'll be well-equipped to ensure the reliability, performance, and security of your applications and infrastructure.</p>
<p>Check out our other blogs:</p>
<ul>
<li><a href="https://www.elastic.co/observability-labs/blog/service-level-objectives-slos-logs-metrics">Build better Service Level Objectives (SLOs) from logs and metrics</a></li>
<li><a href="https://www.elastic.co/observability-labs/blog/aws-vpc-flow-log-analysis-with-genai-elastic">AWS VPC Flow log analysis with GenAI in Elastic</a></li>
<li><a href="https://www.elastic.co/observability-labs/blog/migrating-billion-log-lines-opensearch-elasticsearch">Migrating 1 billion log lines from OpenSearch to Elasticsearch</a></li>
<li><a href="https://www.elastic.co/observability-labs/blog/pruning-incoming-log-volumes">Pruning incoming log volumes with Elastic</a></li>
</ul>
<p>Ready to get started? Use Elastic Observability on Elastic Cloud — the hosted Elasticsearch service that includes all of the latest features.</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/best-practices-logging/best-practices-log-management.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Bringing Your Cloud-Managed Kubernetes Audit Logs into Elasticsearch]]></title>
            <link>https://www.elastic.co/observability-labs/blog/bringing-your-cloud-managed-kubernetes-audit-logs-into-elasticsearch</link>
            <guid isPermaLink="false">bringing-your-cloud-managed-kubernetes-audit-logs-into-elasticsearch</guid>
            <pubDate>Mon, 19 Aug 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[How to bring your Cloud-Managed Kubernetes Audit Logs into Elasticsearch]]></description>
            <content:encoded><![CDATA[<h2>Introduction:</h2>
<p>Kubernetes audit logs are essential for ensuring the security, compliance, and transparency of Kubernetes clusters. However, with managed Kubernetes infrastructure, traditional audit file-based log shipping is often not supported, and audit logs are only available via the control plane API or the Cloud Provider logging facility. In this blog, we will show you how to ingest the audit logs from these other sources and still take advantage of the <a href="https://www.elastic.co/docs/current/integrations/kubernetes/audit-logs">Elastic Kubernetes Audit Log Integration</a>.</p>
<p>In this blog we will be focusing on AWS as our cloud provider and when ingesting logs from AWS you have several options:</p>
<ul>
<li><a href="https://www.elastic.co/docs/current/integrations/aws_logs">AWS Custom Logs integration</a> (which we will utilize in this blog)</li>
<li><a href="https://www.elastic.co/observability-labs/blog/aws-kinesis-data-firehose-observability-analytics">AWS Firehose</a> to send logs from Cloudwatch to Elastic</li>
<li><a href="https://www.elastic.co/docs/current/integrations/aws">AWS General integration</a> which supports many AWS sources</li>
</ul>
<p>In part 1 of this two-part series, we will focus on properly ingesting Kubernetes Audit, and part 2 will focus on investigation, analytics, and alerting.</p>
<p>Kubernetes auditing <a href="https://kubernetes.io/docs/tasks/debug/debug-cluster/audit/">documentation</a> describes the need for auditing in order to get answers to the questions below:</p>
<ul>
<li>What happened?</li>
<li>When did it happen?</li>
<li>Who initiated it?</li>
<li>What resource did it occur on?</li>
<li>Where was it observed?</li>
<li>From where was it initiated (Source IP)?</li>
<li>Where was it going (Destination IP)?</li>
</ul>
<p>Answers to the above questions become important when an incident occurs and an investigation follows. Alternatively, it could just be a log retention use case for a regulated company trying to fulfill compliance requirements. </p>
<p>We are giving special importance to audit logs in Kubernetes because audit logs are not enabled by default. Audit logs can take up a large amount of memory and storage. So, usually, it’s a balance between retaining/investigating audit logs against giving up resources budgeted otherwise for workloads to be hosted on the Kubernetes cluster. Another reason we’re talking about audit logs in Kubernetes is that, unlike usual container logs, after being turned on, these logs are orchestrated to write to the cloud provider’s logging service. This is true for most cloud providers because the Kubernetes control plane is managed by the cloud providers. It makes sense for cloud providers to use their built-in orchestration workflows involving the control plane for a managed service backed by their implementation of a logging framework.</p>
<p>Kubernetes audit logs can be quite verbose by default. Hence, it becomes important to selectively choose how much logging needs to be done so that all the audit requirements are met for the organization. This is done in the <a href="https://kubernetes.io/docs/tasks/debug/debug-cluster/audit/#audit-policy">audit policy</a> file. The audit policy file is submitted against the<code> kube-apiserver.</code> It is not necessary that all flavors of cloud-provider-hosted Kubernetes clusters allow you to play with the <code>kube-apiserver</code> directly. For example, AWS EKS allows for this <a href="https://docs.aws.amazon.com/eks/latest/userguide/control-plane-logs.html">logging</a> to be done only by the control plane.</p>
<p><strong>In this blog we will be using Elastic Kubernetes Service (Amazon EKS) on AWS with the Kubernetes Audit Logs that are automatically shipped to AWS CloudWatch.</strong></p>
<p>A sample audit log for a secret by the name “empty-secret” created by an admin user on EKS  is logged on AWS CloudWatch in the following format: </p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/bringing-your-cloud-managed-kubernetes-audit-logs-into-elasticsearch/aws-clougwatch-logs.png" alt="Alt text" /></p>
<p>Once the audit logs show up on CloudWatch, it is time to consider how to transfer them to Elasticsearch. Elasticsearch is a great platform for creating dashboards that visualize different audit events recorded in a Kubernetes cluster. It is also a powerful tool for analyzing various audit events. For example, how many secret object creation attempts were made in an hour? </p>
<p>Now that we established the Kubernetes audit logs are being logged in CloudWatch, let’s discuss how to get the logs ingested into Elasticsearch. Elasticsearch has an integration to consume logs written on CloudWatch. Just using this integration by default is going to get the JSON from CloudWatch as is i.e. the real audit log JSON is nested inside the wrapper CloudWatch JSON. When bringing logs to Elasticsearch, it is important that we use the <a href="https://www.elastic.co/guide/en/ecs/current/index.html">Elastic Common Schema</a>(ECS) to get the best search and analytics performance. This means that there needs to  be an ingest pipeline that parses a standard Kubernetes audit JSON message and creates an ECS Compliant document in Elasticsearch. Let’s dive into how to achieve this.</p>
<p>Elasticsearch has a Kubernetes integration using Elastic Agent to consume Kubernetes container logs from the console and audit logs written to a file path. For a cloud-provider use case, as described above, it may not be feasible to write audit logs to a path on the Kubernetes cluster. So, how do we leverage the<a href="https://github.com/elastic/integrations/blob/main/packages/kubernetes/data_stream/audit_logs/fields/fields.yml"> ECS designed for parsing the Kubernetes audit logs</a> already implemented in the Kubernetes integration to work on the CloudWatch audit logs? That is the most exciting plumbing piece! Let’s see how to do it.</p>
<h3>What we’re going to do is:</h3>
<ul>
<li>
<p>Read the Kubernetes audit logs from the cloud provider’s logging module, in our case, AWS CloudWatch since this is where logs reside. We will use Elastic Agent and <a href="https://www.elastic.co/docs/current/integrations/aws_logs">Elasticsearch AWS Custom Logs integration </a> to read from logs from CloudWatch. <strong>Note:</strong> please be aware, there are several Elastic AWS integration, we are specifically using the AWS Custom Logs integration.</p>
</li>
<li>
<p>Create two simple ingest pipelines (we do this for best practices of isolation and composability) </p>
</li>
<li>
<p>The first pipeline looks for Kubernetes audit JSON messages and then redirects them to the second pipeline</p>
</li>
<li>
<p>The second custom pipeline will associate the JSON <code>message</code> field with the correct field expected by the Elasticsearch Kubernetes Audit managed pipeline (aka the Integration) and then <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/reroute-processor.html"><code>reroute</code></a> the message to the correct data stream, <code>kubernetes.audit_logs-default,</code> which in turn applies all the proper mapping and ingest pipelines for the incoming message</p>
</li>
<li>
<p>The overall flow will be</p>
</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/bringing-your-cloud-managed-kubernetes-audit-logs-into-elasticsearch/overall-ingestion-flow.png" alt="Alt text" /></p>
<h3>1. Create an AWS CloudWatch integration:</h3>
<p>a.  Populate the AWS access key and secret pair values</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/bringing-your-cloud-managed-kubernetes-audit-logs-into-elasticsearch/aws-custom-logs-integration-1.png" alt="Alt text" /></p>
<p>b. In the logs section, populate the log ARN, Tags and Preserve the original event if you want to, and then Save this integration and exit from the page</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/bringing-your-cloud-managed-kubernetes-audit-logs-into-elasticsearch/aws-custom-logs-integration-2.png" alt="Alt text" /></p>
<h3>2. Next, we will configure the custom ingest pipeline</h3>
<p>We are doing this because we want to override what the generic managed pipeline does. We will retrieve the custom component name by searching for managed pipeline created as an asset when we install the AWS CloudWatch integration. In this case we will be adding the custom ingest pipeline <code>logs-aws_logs.generic@custom</code></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/bringing-your-cloud-managed-kubernetes-audit-logs-into-elasticsearch/aws-logs-index-management.png" alt="Alt text" /></p>
<p>From the Dev tools console, run below. Here, we are extracting the message field from the CloudWatch JSON and putting the value in a field called kubernetes.audit. Then, we are rerouting this message to the default Kubernetes audit dataset or ECS that comes with Kubernetes integration</p>
<pre><code>PUT _ingest/pipeline/logs-aws_logs.generic@custom
{
    &quot;processors&quot;: [
      {
        &quot;pipeline&quot;: {
          &quot;if&quot;: &quot;ctx.message.contains('audit.k8s.io')&quot;,
          &quot;name&quot;: &quot;logs-aws-process-k8s-audit&quot;
        }
      }
    ]
}

PUT _ingest/pipeline/logs-aws-process-k8s-audit
{
  &quot;processors&quot;: [
    {
      &quot;json&quot;: {
        &quot;field&quot;: &quot;message&quot;,
        &quot;target_field&quot;: &quot;kubernetes.audit&quot;
      }
    },
    {
      &quot;remove&quot;: {
        &quot;field&quot;: &quot;message&quot;
      }
    },
    {
      &quot;reroute&quot;: {
        &quot;dataset&quot;: &quot;kubernetes.audit_logs&quot;,
        &quot;namespace&quot;: &quot;default&quot;
      }
    }
  ]
}
</code></pre>
<p>Let’s understand this further:</p>
<ul>
<li>
<p>When we create a Kubernetes integration, we get a managed index template called <code>logs-kubernetes.audit_logs</code> that writes to the pipeline called <code>logs-kubernetes.audit_logs-1.62.2</code> by default</p>
</li>
<li>
<p>If we look into the pipeline<code> logs-kubernetes.audit_logs-1.62.2</code>, we see that all the processor logic is working against the field <code>kubernetes.audit</code>. This is the reason why our json processor in the above code snippet is creating a field called <code>kubernetes.audit </code>before dropping the original <em>message</em> field and rerouting. Rerouting is directed to the <code>kubernetes.audit_logs</code> dataset that backs the <code>logs-kubernetes.audit_logs-1.62.2</code> pipeline (dataset name is derived from the pipeline name convention that’s in the format <code>logs-&lt;datasetname&gt;-version</code>)</p>
</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/bringing-your-cloud-managed-kubernetes-audit-logs-into-elasticsearch/ingest-pipelines.png" alt="Alt text" /></p>
<h3>3.  Now let’s verify that the logs are actually flowing through and the audit message is being parsed</h3>
<p>a. We will use Elastic Agent and enroll using Fleet and the integration policy we created in the Step 1. There are a number of ways to <a href="https://www.elastic.co/guide/en/fleet/current/install-fleet-managed-elastic-agent.html">deploy Elastic Agent</a> and for this exercise we will deploy using docker which is quick and easy.</p>
<pre><code>% docker run --env FLEET_ENROLL=1 --env FLEET_URL=&lt;&lt;fleet_URL&gt;&gt; --env FLEET_ENROLLMENT_TOKEN=&lt;&lt;fleet_enrollment_token&gt;&gt;  --rm docker.elastic.co/beats/elastic-agent:8.19.13
</code></pre>
<p>b. Check the messages in Discover. In 8.15 there is also a new feature called Logs Explorer which provides an ability to see Kubernetes Audit logs (and container logs) with a few clicks (see image below). Voila! We can see the Kubernetes audit messages parsed!</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/bringing-your-cloud-managed-kubernetes-audit-logs-into-elasticsearch/discover.jpg" alt="Alt text" /></p>
<h3>4. Let's do a quick recap of what we did</h3>
<p>We configured CloudWatch integration in Elasticsearch to read Kubernetes audit logs from CloudWatch. Then, we created custom ingest pipelines to reroute the audit messages to the correct data stream and all the OOTB mappings and parsing that come with the Kubernetes Audit Logs integration. </p>
<p>In the next part, we’ll look at how to analyze the ingested Kubernetes Audit log data.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/bringing-your-cloud-managed-kubernetes-audit-logs-into-elasticsearch/bringing-your-cloud-managed-kubernetes-audit-logs-into-elasticsearch.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Configure downsampling directly in Elastic Streams, no more JSON editing needed]]></title>
            <link>https://www.elastic.co/observability-labs/blog/configure-downsampling-elastic-streams</link>
            <guid isPermaLink="false">configure-downsampling-elastic-streams</guid>
            <pubDate>Tue, 02 Jun 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Configure downsampling in Elastic Streams alongside retention and tiers, with a live preview and validation. No more editing ILM or lifecycle JSON.]]></description>
            <content:encoded><![CDATA[<p>Starting in Elastic 9.4 (generally available), <a href="https://www.elastic.co/docs/solutions/observability/streams/streams">Streams</a> lets you view and configure downsampling directly in the Retention tab, alongside retention periods, data tiers, and ingestion context. Open a stream, see how it ages, and change it in one place.</p>
<p>Elasticsearch has supported downsampling for a while, through <a href="https://www.elastic.co/docs/manage-data/data-store/data-streams/run-downsampling">ILM policies and data stream lifecycle</a>. But configuring it meant leaving the stream you were looking at, finding the right policy or lifecycle definition, editing JSON, and hoping the intervals were valid. That round trip is gone.</p>
<h2>ILM-backed streams</h2>
<p>ILM ties downsampling to phases. Each phase (hot, warm, cold) can carry one downsample action with a <code>fixed_interval</code>. Streams now shows these actions on the data lifecycle timeline. Click a phase to open a flyout, set the interval, and watch the timeline update before you save.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/configure-downsampling-elastic-streams/ilm-phase-flyout.png" alt="ILM phase flyout with downsample interval and live timeline preview in Elastic Streams" /></p>
<p>Here's the same policy as JSON. This is what Streams is reading and writing when you use the flyout:</p>
<pre><code class="language-json">{
  &quot;phases&quot;: {
    &quot;hot&quot;: {
      &quot;min_age&quot;: &quot;0ms&quot;,
      &quot;actions&quot;: {
        &quot;rollover&quot;: { &quot;max_age&quot;: &quot;1d&quot; },
        &quot;downsample&quot;: {
          &quot;fixed_interval&quot;: &quot;5m&quot;,
          &quot;wait_timeout&quot;: &quot;1d&quot;
        }
      }
    },
    &quot;warm&quot;: {
      &quot;min_age&quot;: &quot;2d&quot;,
      &quot;actions&quot;: {}
    },
    &quot;cold&quot;: {
      &quot;min_age&quot;: &quot;4d&quot;,
      &quot;actions&quot;: {
        &quot;downsample&quot;: {
          &quot;fixed_interval&quot;: &quot;10m&quot;,
          &quot;wait_timeout&quot;: &quot;1d&quot;
        }
      }
    },
    &quot;frozen&quot;: {
      &quot;min_age&quot;: &quot;8d&quot;,
      &quot;actions&quot;: {
        &quot;searchable_snapshot&quot;: {
          &quot;snapshot_repository&quot;: &quot;found-snapshots&quot;,
          &quot;force_merge_index&quot;: true
        }
      }
    },
    &quot;delete&quot;: {
      &quot;min_age&quot;: &quot;30d&quot;,
      &quot;actions&quot;: {
        &quot;delete&quot;: {
          &quot;delete_searchable_snapshot&quot;: true
        }
      }
    }
  }
}
</code></pre>
<p>ILM policies are often shared across multiple streams. If you edit one that's in use elsewhere, Streams warns you and offers <strong>Save as new policy</strong> so you can fork instead of changing the original.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/configure-downsampling-elastic-streams/save-as-new-policy.png" alt="Save as new policy prompt when editing a shared ILM policy in Elastic Streams" /></p>
<h2>Data stream lifecycle (DLM) streams</h2>
<p><a href="https://www.elastic.co/docs/manage-data/data-store/data-streams">Data stream</a> lifecycle takes a different approach: instead of one downsample per phase, you define a sequence of steps (up to 10), each with an <code>after</code> delay and a <code>fixed_interval</code>. Streams shows this as a visual ladder you can build up step by step.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/configure-downsampling-elastic-streams/dlm-downsample-steps.png" alt="DLM-backed stream showing multiple downsample steps in sequence on the lifecycle timeline" /></p>
<p>This is the same configuration, but as an API call. Each step in the UI maps to one object in the downsampling array:</p>
<pre><code class="language-json">{
  &quot;data_retention&quot;: &quot;14d&quot;,
  &quot;downsampling&quot;: [
    { &quot;after&quot;: &quot;1d&quot;, &quot;fixed_interval&quot;: &quot;5m&quot; },
    { &quot;after&quot;: &quot;3d&quot;, &quot;fixed_interval&quot;: &quot;1h&quot; },
    { &quot;after&quot;: &quot;7d&quot;, &quot;fixed_interval&quot;: &quot;4h&quot; }
  ]
}
</code></pre>
<h2>Get started</h2>
<ul>
<li>Open <strong>Streams</strong> in Kibana and select a stream backed by a <a href="https://www.elastic.co/docs/manage-data/data-store/data-streams/time-series-data-stream-tsds">time series data stream</a> (TSDS).</li>
<li>Go to the <strong>Retention</strong> tab. The data lifecycle timeline shows your current downsampling configuration (if any).</li>
<li>For ILM streams, click a phase on the timeline to open the flyout and configure the downsample interval.</li>
<li>For DLM streams, add or edit downsample steps directly in the lifecycle view.</li>
</ul>
<h2>Learn more</h2>
<ul>
<li><a href="https://www.elastic.co/docs/manage-data/data-store/data-streams/downsampling-concepts">Downsampling concepts</a></li>
<li><a href="https://www.elastic.co/docs/manage-data/data-store/data-streams/run-downsampling">Configuring downsampling (ILM and data stream lifecycle)</a></li>
<li><a href="https://www.elastic.co/docs/solutions/observability/streams/management/retention">Manage data retention for Streams</a></li>
</ul>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/configure-downsampling-elastic-streams/article.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Customize your data ingestion with Elastic input packages]]></title>
            <link>https://www.elastic.co/observability-labs/blog/customize-data-ingestion-input-packages</link>
            <guid isPermaLink="false">customize-data-ingestion-input-packages</guid>
            <pubDate>Tue, 26 Sep 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[In this post, learn about input packages and how they can provide a flexible solution to advanced users for customizing their ingestion experience in Elastic.]]></description>
            <content:encoded><![CDATA[<p>Elastic&lt;sup&gt;®&lt;/sup&gt; has enabled the collection, transformation, and analysis of data flowing between the external data sources and Elastic Observability Solution through <a href="https://www.elastic.co/integrations/">integrations</a>. Integration packages achieve this by encapsulating several components, including <a href="https://www.elastic.co/guide/en/fleet/current/create-standalone-agent-policy.html">agent configuration</a>, inputs for data collection, and assets like <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/ingest.html">ingest pipelines</a>, <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/data-streams.html">data streams</a>, <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/index-templates.html">index templates</a>, and <a href="https://www.elastic.co/guide/en/kibana/current/dashboard.html">visualizations</a>. The breadth of these assets supported in the Elastic Stack increases day by day.</p>
<p>This blog dives into how input packages provide an extremely generic and flexible solution to the advanced users for customizing their ingestion experience in Elastic.</p>
<h2>What are input packages?</h2>
<p>An <a href="https://github.com/elastic/elastic-package">Elastic Package</a> is an artifact that contains a collection of assets that extend the Elastic Stack, providing new capabilities to accomplish a specific task like integration with an external data source. The first use of Elastic packages is <a href="https://github.com/elastic/integrations">integration packages</a>, which provide an end-to-end experience — from configuring Elastic Agent, to collecting signals from the data source, to ingesting them correctly and using the data once ingested.</p>
<p>However, advanced users may need to customize data collection, either because an integration does not exist for a specific data source, or even if it does, they want to collect additional signals or in a different way. Input packages are another type of <a href="https://github.com/elastic/elastic-package">Elastic package</a> that provides the capability to configure Elastic Agent to use the provided inputs in a custom way.</p>
<h2>Let’s look at an example</h2>
<p>Say hello to Julia, who works as an engineer at Ascio Innovation firm. She is currently working with Oracle Weblogic server and wants to get a set of metrics for monitoring it. She goes ahead and installs Elastic <a href="https://docs.elastic.co/integrations/oracle_weblogic">Oracle Weblogic Integration</a>, which uses Jolokia in the backend to fetch the metrics.</p>
<p>Now, her team wants to advance in the monitoring and has the following requirements:</p>
<ol>
<li>
<p>We should be able to extract metrics other than the default ones, which are not supported by the default Oracle Weblogic Integration.</p>
</li>
<li>
<p>We want to have our own bespoke pipelines, visualizations, and experience.</p>
</li>
<li>
<p>We should be able to identify the metrics coming in from two different instances of Weblogic Servers by having data mapped to separate <a href="https://www.elastic.co/blog/what-is-an-elasticsearch-index">indices</a>.</p>
</li>
</ol>
<p>All the above requirements can be met by using the <a href="https://docs.elastic.co/integrations/jolokia">Jolokia input package</a> to get a customized experience. Let's see how.</p>
<p>Julia can add the configuration of Jolokia input package as below, fulfilling the <em>first requirement.</em></p>
<p>hostname, JMX Mappings for the fields you want to fetch for the JVM application, and the <a href="https://www.elastic.co/guide/en/ecs/master/ecs-data_stream.html#field-data-stream-dataset">data set</a> name to which the response fields would get mapped.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/customize-data-ingestion-input-packages/elastic-blog-1-config-parameters.png" alt="Configuration Parameters for Jolokia Input package" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/customize-data-ingestion-input-packages/elastic-blog-2-expanded-doc.png" alt="Metrics getting mapped to the index created by the ‘jolokia_first_dataset’" /></p>
<p>Julia can customize her data by writing her own <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/ingest.html">ingest pipelines</a> and providing her customized <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html">mappings</a>. Also, she can then build her own bespoke dashboards, hence meeting her <em>second requirement.</em></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/customize-data-ingestion-input-packages/elastic-blog-3-ingest-pipelines.png" alt="Customization of Ingest Pipelines and Mappings" /></p>
<p>Let’s say now Julia wants to use another instance of Oracle Weblogic and get a different set of metrics.</p>
<p>This can be achieved by adding another instance of Jolokia input package and specifying a new <a href="https://www.elastic.co/guide/en/ecs/master/ecs-data_stream.html#field-data-stream-dataset">data set</a> name as shown in the screenshot below. The resultant metrics will be mapped to a different <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html">index</a>/data set hence fulfilling her <em>third requirement.</em> This will help Julia to differentiate metrics coming in from two different instances of Oracle Weblogic.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/customize-data-ingestion-input-packages/elastic-blog-4-jolokia.png" alt="jolokia metrics" /></p>
<p>The resultant metrics of the query will be indexed to the new data set, jolokia_second_dataset in the below example.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/customize-data-ingestion-input-packages/elastic-blog-5-dataset.png" alt="dataset" /></p>
<p>As we can see above, the Jolokia input package provides the flexibility to get new metrics by specifying different JMX Mappings, which are not supported in the default Oracle Weblogic integration (the user gets metrics from a predetermined set of JMX Mappings).</p>
<p>The Jolokia Input package also can be used for monitoring any Java-based application, which pushes its metrics through JMX. So a single input package can be used to collect metrics from multiple Java applications/services.</p>
<h2>Elastic input packages</h2>
<p>Elastic has started supporting input packages from the 8.8.0 release. Some of the input packages are now available in beta and will mature gradually:</p>
<ol>
<li>
<p><a href="https://docs.elastic.co/integrations/sql">SQL input package</a>: The SQL input package allows you to execute queries against any SQL database and store the results in Elasticsearch&lt;sup&gt;®&lt;/sup&gt;.</p>
</li>
<li>
<p><a href="https://docs.elastic.co/integrations/prometheus_input">Prometheus input package</a>: This input package can collect metrics from <a href="https://prometheus.io/docs/instrumenting/exporters/">Prometheus Exporters (Collectors)</a>.It can be used by any service exporting its metrics to a Prometheus endpoint.</p>
</li>
<li>
<p><a href="https://docs.elastic.co/integrations/jolokia">Jolokia input package</a>: This input package collects metrics from <a href="https://jolokia.org/agent.html">Jolokia agents</a> running on a target JMX server or dedicated proxy server. It can be used for monitoring any Java-based application, which pushes its metrics through JMX.</p>
</li>
<li>
<p><a href="https://docs.elastic.co/integrations/statsd_input">Statsd input package</a>: The statsd input package spawns a UDP server and listens for metrics in StatsD compatible format. This input can be used to collect metrics from services that send data over the StatsD protocol.</p>
</li>
<li>
<p><a href="https://docs.elastic.co/integrations/gcp_metrics">GCP Metrics input package</a>: The GCP Metrics input package can collect custom metrics for any GCP service.</p>
</li>
</ol>
<h2>Try it out!</h2>
<p>Now that you know more about input packages, try building your own customized integration for your service through input packages, and get started with an <a href="https://cloud.elastic.co/registration?fromURI=/home">Elastic Cloud</a> free trial.</p>
<p>We would love to hear from you about your experience with input packages on the Elastic <a href="https://discuss.elastic.co/">Discuss</a> forum or in <a href="https://github.com/elastic/integrations">the Elastic Integrations repository</a>.</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/customize-data-ingestion-input-packages/customize-observability-input-720x420.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Elastic Observability: Streams Data Quality and Failure Store Insights]]></title>
            <link>https://www.elastic.co/observability-labs/blog/data-quality-and-failure-store-in-streams</link>
            <guid isPermaLink="false">data-quality-and-failure-store-in-streams</guid>
            <pubDate>Tue, 18 Nov 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Discover how the Streams a new AI driven Elastic Observability feature help manage data quality with a failure store to help you monitor, troubleshoot, and retain high-quality data.]]></description>
            <content:encoded><![CDATA[<p>When working with observability and logging data, not all documents make it into Elasticsearch in pristine condition. Some may be dropped due to processing failures in ingest pipelines or mapping errors, while others may be partially ingested with ignored fields if a fields value is incompatible with the defined mappings. These issues can impact downstream analysis and dashboards. Streams data quality makes it easier than ever to monitor the health of your ingested data, identify potential issues, and take corrective action right from the UI. With data quality, you can now see exactly how well your Stream is performing and quickly understand whether your data has a <strong>Good</strong>, <strong>Degraded</strong>, or <strong>Poor</strong> quality.</p>
<h2>What's in data quality</h2>
<p><img src="https://www.elastic.co/observability-labs/assets/images/data-quality-and-failure-store-in-streams/data-quality-tab.png" alt="Data quality tab" /></p>
<h3>At-a-glance summary</h3>
<p>The summary card shows:</p>
<ul>
<li><strong>Degraded documents</strong> - Documents that contain the <code>_ignored</code> field - see <a href="https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/mapping-ignored-field">this</a> for more info.</li>
<li><strong>Failed documents</strong> - Documents that were rejected at ingestion due to mapping conflicts or pipeline failures.</li>
</ul>
<p>The overall <strong>quality score</strong> (Good, Degraded, Poor) is automatically calculated based on the percentage of degraded and failed documents.</p>
<h3>Trends over time</h3>
<p>The tab includes a time-series chart so you can track how degraded and failed documents are accumulating over time. Use the <strong>date picker</strong> to zoom into a specific range and understand when problems are spiking.</p>
<h3>Quality issues table</h3>
<p>A detailed table lists the types of issues affecting your stream. For each issue, you can:</p>
<ul>
<li>See which fields are causing problems.</li>
<li>Review counts of affected documents.</li>
<li>Filter by issues that have not been solved yet (Current issues only).</li>
<li>Open a <strong>flyout</strong> to dive deeper into the cause of the issue and learn how to fix it.</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/data-quality-and-failure-store-in-streams/quality-issue-flyout.png" alt="Data quality issue flyout" /></p>
<h2>Monitoring degraded documents</h2>
<p>A degraded document is one that contains the <code>_ignored</code> field, which means one or more of its fields were ignored during indexing. One of the reasons could be that their values didn’t match the expected mappings. While the rest of the document is still indexed, a high number of degraded documents can affect query results, dashboards, and overall observability accuracy.</p>
<p>To help keep these issues under control, the Data quality tab provides visibility into the percentage of degraded documents in your stream.</p>
<h3>Set up a rule to stay ahead of issues</h3>
<p>You can use the <strong>Create rule</strong> button above the Degraded docs chart to define an alert that notifies you when the percentage of degraded documents crosses a certain threshold. This makes it easy to proactively monitor for mapping mismatches and ensure your data continues to meet quality expectations.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/data-quality-and-failure-store-in-streams/create-rule-button.png" alt="Create rule button" /></p>
<p>For more information on how to configure this rule, see <a href="https://www.elastic.co/docs/solutions/observability/incident-management/create-a-degraded-docs-rule#degraded-docs-rule-conditions">Degraded docs rule conditions</a>.</p>
<h2>Handling failed documents with the failure store</h2>
<p><a href="https://www.elastic.co/docs/manage-data/data-store/data-streams/failure-store"><strong>Failure store</strong></a> is a special index that captures documents rejected during ingestion. Instead of losing this data, the failure store retains it in a dedicated <code>::failures</code> index, allowing you to inspect the problematic documents, understand what went wrong, and fix the underlying issues.</p>
<p>In Data Quality tab, the failed documents are only visible if your stream has a failure store enabled, for checking failure store documents you are required to have at least <code>read_failure_store</code> privileges. If the failure store is <strong>not enabled</strong>, you’ll see an <strong>“Enable failure store”</strong> link that opens a modal to configure it and set the retention period. For enabling failure store you are required to have <code>manage_failure_store</code> privileges over the specific data stream. For further information about failure store security you can refer to <a href="https://www.elastic.co/docs/manage-data/data-store/data-streams/failure-store#use-failure-store-searching">Searching failures</a>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/data-quality-and-failure-store-in-streams/enable-fs-link.png" alt="Enable failure store link" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/data-quality-and-failure-store-in-streams/failure-store-modal.png" alt="Failure store configuration modal" /></p>
<p>Once enabled, you can <strong>edit the failure store configuration</strong> or disable it at any time using the <strong>Edit</strong> button above the failed docs chart.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/data-quality-and-failure-store-in-streams/edit-fs-button.png" alt="Edit failure store button" /></p>
<p>The failure store can also be configured in the Streams Retention tab - see <a href="https://www.elastic.co/observability-labs/blog/simplifying-retention-management-with-streams.mdx">this article</a> for more information.</p>
<h2>Technical implementation</h2>
<p>Under the hood, the <strong>Data quality</strong> tab builds on the existing <strong>Dataset quality</strong> plugin - the same one that powers the <a href="https://www.elastic.co/docs/solutions/observability/data-set-quality-monitoring"><strong>Dataset quality page</strong></a> in <strong>Stack Management</strong>. However, instead of working in the context of datasets following the Data stream naming scheme, it’s now tailored specifically for <strong>streams</strong>.</p>
<p>To determine the quality of a stream, the UI sends three <strong>ES|QL</strong> query server requests:</p>
<ol>
<li><strong>All documents (including failures):</strong></li>
</ol>
<pre><code class="language-sql"> FROM myStream, myStream::failures | STATS doc_count = COUNT(*)
</code></pre>
<ol start="2">
<li><strong>Failed documents only:</strong></li>
</ol>
<pre><code class="language-sql"> FROM myStream::failures | STATS failed_doc_count = COUNT(*)
</code></pre>
<ol start="3">
<li><strong>Degraded documents:</strong></li>
</ol>
<pre><code class="language-sql">FROM myStream METADATA _ignored | WHERE _ignored IS NOT NULL | STATS degraded_doc_count = COUNT(*)
</code></pre>
<p>The results of these queries are then used to calculate the <strong>percentages</strong> of failed and degraded documents. The overall data quality is determined using simple thresholds:</p>
<ul>
<li><strong>Good:</strong> Both percentages are 0%</li>
<li><strong>Degraded:</strong> Any percentage is greater than 0% but less than 3%</li>
<li><strong>Poor:</strong> Any percentage is above 3%</li>
</ul>
<p>For managing the <strong>failure store</strong>, Streams uses the <a href="https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-indices-put-data-stream-options">Update data stream options API</a> with the <code>failure_store</code> parameter to configure and update the failure store settings, including enabling the store and setting the retention period.</p>
<h2>Why you’ll love this</h2>
<p>The new <strong>Data quality</strong> tab gives you:</p>
<ul>
<li>Visibility into ingestion problems without digging into logs</li>
<li>A clear breakdown of degraded vs. failed documents</li>
<li>Insights into which fields are ignored and why</li>
<li>Tools to capture and troubleshoot failed docs with the failure store</li>
</ul>
<p>By surfacing data quality issues directly in the Streams UI, we’re making it easier to keep your data flowing reliably and to ensure your analytics are built on a strong foundation.</p>
<h2><strong>Try it out today</strong></h2>
<p>The <strong>data quality</strong> feature is available in <strong>Elastic Observability on Serverless</strong>, and coming soon for self-managed and Elastic Cloud users.</p>
<p>Sign up for an Elastic trial at <a href="http://cloud.elastic.co">cloud.elastic.co</a>, and trial Elastic's Serverless offering which will allow you to play with all of the Streams functionality.</p>
<p>For more information on Streams:</p>
<p><em>Read about</em> <a href="https://www.elastic.co/observability-labs/blog/reimagine-observability-elastic-streams"><em>Reimagining streams</em></a></p>
<p><em>Look at the</em> <a href="http://elastic.co/elasticsearch/streams"><em>Streams website</em></a></p>
<p><em>Read the</em> <a href="https://www.elastic.co/docs/solutions/observability/streams/streams"><em>Streams documentation</em></a></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/data-quality-and-failure-store-in-streams/article.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Deploying Elastic Agent with Confluent Cloud's Elasticsearch Connector]]></title>
            <link>https://www.elastic.co/observability-labs/blog/deploying-elastic-agent-with-confluent-clouds-elasticsearch-connector</link>
            <guid isPermaLink="false">deploying-elastic-agent-with-confluent-clouds-elasticsearch-connector</guid>
            <pubDate>Wed, 22 Jan 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Confluent Cloud users can now use the updated Elasticsearch Sink Connector with Elastic Agent and Elastic Integrations for a fully-managed and highly scalable data ingest architecture.]]></description>
            <content:encoded><![CDATA[<p>Elastic and Confluent are key technology partners and we're pleased to announce new investments in that partnership. Built by the original creators of Apache Kafka®, Confluent's data streaming platform is a key component of many Enterprise ingest architectures, and it ensures that customers can guarantee delivery of critical Observability and Security data into their Elasticsearch clusters. Together, we've been working on key improvements to how our products fit together. With <a href="https://www.elastic.co/blog/elastic-agent-output-kafka-data-collection-streaming">Elastic Agent's new Kafka output</a> and Confluent's newly improved <a href="https://www.confluent.io/hub/confluentinc/kafka-connect-elasticsearch/">Elasticsearch Sink Connectors</a> it's never been easier to seamlessly collect data from the edge, stream it through Kafka, and into an Elasticsearch cluster.</p>
<p>In this blog, we examine a simple way to integrate Elastic Agent with Confluent Cloud's Kafka offering to reduce the operational burden of ingesting business-critical data.</p>
<h2>Benefits of Elastic Agent and Confluent Cloud</h2>
<p>When combined, Elastic Agent and Confluent Cloud's updated Elasticsearch Sink connector provide a myriad of advantages for organizations of all sizes. This combined solution offers flexibility in handling any type of data ingest workload in an efficient and resilient manner.</p>
<h3>Fully Managed</h3>
<p>When combined, Elastic Cloud Serverless and Confluent Cloud provide users with a fully managed service. This makes it effortless to deploy and ingest nearly unlimited data volumes without having to worry about nodes, clusters, or scaling.</p>
<h3>Full Elastic Integrations Support</h3>
<p>Sending data through Kafka is fully supported with any of the 300+ Elastic Integrations. In this blog post, we outline how to set up the connection between the two platforms. This ensures you can benefit from our investments in built-in alerts, SLOs, AI Assistants, and more.</p>
<h3>Decoupled Architecture</h3>
<p>Kafka acts as a resilient buffer between data sources (such as Elastic Agent and Logstash) and Elasticsearch, decoupling data producers from consumers. This can significantly reduce total cost of ownership by enabling you to size your Elasticsearch cluster based on typical data ingest volume, not maximum ingest volume. It also ensures system resilience during spikes in data volume.</p>
<h3>Ultimate control over your data</h3>
<p>With our new Output per Integration capability, customers can now send different data to different destinations using the same agent. Customers can easily send security logs directly to Confluent Cloud/Kafka, which can provide delivery guarantees, while sending less critical application logs and system metrics directly to Elasticsearch.</p>
<h2>Deploying the reference architecture</h2>
<p>In the following sections, we will walk you through one of the ways Confluent Kafka can be integrated with Elastic Agent and Elasticsearch using Confluent Cloud's Elasticsearch Sink Connector. As with any streaming and data collection technology, there are many ways a pipeline can be configured depending on the particular use case. This blog post will focus on a simple architecture that can be used as a starting point for more complex deployments.</p>
<p>Some of the highlights of this architecture are:</p>
<ul>
<li>Dynamic Kafka topic selection at Elastic Agents</li>
<li>Elasticsearch Sink Connectors for fully managed transfer from Confluent Kafka to Elasticsearch</li>
<li>Processing data leveraging Elastic's 300+ Integrations</li>
</ul>
<h3>Prerequisites</h3>
<p>Before getting started ensure you have a Kafka cluster deployed in Confluent Cloud, an Elasticsearch cluster or project deployed in Elastic Cloud, and an installed and enrolled Elastic Agent.</p>
<h3>Configure Confluent Cloud Kafka Cluster for Elastic Agent</h3>
<p>Navigate to the Kafka cluster in Confluent Cloud, and select <code>Cluster Settings</code>. Locate and note the <code>Bootstrap Server</code> address, we will need this value later when we create the Kafka Output in Fleet.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/deploying-elastic-agent-with-confluent-clouds-elasticsearch-connector/confluent-cluster-settings.png" alt="Confluent Cluster Settings" /></p>
<p>Navigate to <code>Topics</code> in the left-hand navigation menu and create two topics:</p>
<ol>
<li>A topic named <code>logs</code></li>
<li>A topic named <code>metrics</code></li>
</ol>
<p>Next, navigate to <code>API Keys</code> in the left-hand navigation menu:</p>
<ol>
<li>Click <code>+ Add API Key</code></li>
<li>Select the <code>Service Account</code> API key type</li>
<li>Provide a meaningful name for this API Key</li>
<li>Grant the key write permission to the <code>metrics</code> and <code>logs</code> topics</li>
<li>Create the key</li>
</ol>
<p>Note the provided Key and the Secret, we will need it later when we configure the Kafka Output in Fleet.</p>
<h3>Configure Elasticsearch and Elastic Agent</h3>
<p>In this section, we will configure the Elastic Agent to send data to Confluent Cloud's Kafka cluster and we will configure Elasticsearch so it can receive data from the Confluent Cloud Elasticsearch Sink Connector.</p>
<h4>Configure Elastic Agent to send data to Confluent Cloud</h4>
<p>Elastic Fleet simplifies sending data to Kafka and Confluent Cloud. With Elastic Agent, a Kafka &quot;output&quot; can be easily attached to all data coming from an agent or it can be applied only to data coming from a specific data source.</p>
<p>Find <code>Fleet</code> in the left-hand navigation, click the <code>Settings</code> tab. On the <code>Settings</code> tab, find the <code>Outputs</code> section and click <code>Add Output</code>.</p>
<p>Perform the following steps to configure the new Kafka output:</p>
<ol>
<li>Provide a <code>Name</code> for the output</li>
<li>Set the <code>Type</code> to <code>Kafka</code></li>
<li>Populate the <code>Hosts</code> field with the <code>Bootstrap Server</code> address we noted earlier .</li>
<li>Under <code>Authentication</code>, populate the <code>Username</code> with the <code>API Key</code> and the <code>Password</code> with the <code>Secret</code> we noted earlier <img src="https://www.elastic.co/observability-labs/assets/images/deploying-elastic-agent-with-confluent-clouds-elasticsearch-connector/fleet-output-configuration.png" alt="Elastic Fleet Output" /></li>
<li>Under <code>Topics</code>, select <code>Dynamic Topic</code> and set <code>Topic from field</code> to <code>data_stream.type</code> <img src="https://www.elastic.co/observability-labs/assets/images/deploying-elastic-agent-with-confluent-clouds-elasticsearch-connector/fleet-output-configuration-dynamic-topic.png" alt="Kafka Output Dynamic Topic Configuration" /></li>
<li>Click <code>Save and apply settings</code></li>
</ol>
<p>Next, we will navigate to the <code>Agent Policies</code> tab in Fleet and click to edit the Agent Policy that we want to attach the Kafka output to. With the Agent Policy open, click the <code>Settings</code> tab and change <code>Output for integrations</code> and <code>Output for agent monitoring</code> to the Kafka output we just created.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/deploying-elastic-agent-with-confluent-clouds-elasticsearch-connector/fleet-agent-policy-kafka.png" alt="Agent Policy Output Configuration" /></p>
<p><strong>Selecting an Output per Elastic Integration</strong>: To set the Kafka output to be used for specific data sources, see the <a href="https://www.elastic.co/guide/en/fleet/master/integration-level-outputs.html">integration-level outputs documentation</a>.</p>
<p><strong>A note about Topic Selection</strong>: The <code>data_stream.type</code> field is a reserved field which Elastic Agent automatically sets to <code>logs</code> if the data we're sending is a log and <code>metrics</code> if the data we're sending is a metric. Enabling Dynamic Topic selection using <code>data_stream.type</code>, will cause Elastic Agent to automatically route metrics to a <code>metrics</code> topic and logs to a <code>logs</code> topic. For information on topic selection, see the Kafka Output's <a href="https://www.elastic.co/guide/en/fleet/master/kafka-output-settings.html#_topics_settings">Topics settings</a> documentation.</p>
<h4>Configuring a publishing endpoint in Elasticsearch</h4>
<p>Next, we will set up two publishing endpoints (data streams) for the Confluent Cloud Sink Connector to use when publishing documents to Elasticsearch:</p>
<ol>
<li>We will create a data stream <code>logs-kafka.reroute-default</code> for handling <strong>logs</strong></li>
<li>We will create a data stream <code>metrics-kafka.reroute-default</code> for handling <strong>metrics</strong></li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/deploying-elastic-agent-with-confluent-clouds-elasticsearch-connector/sink-connector-overview.png" alt="Sink Connector Overview" /></p>
<p>If we were to leave the data in those data streams as-is, the data would be available but we would find the data is unparsed and lacking vital enrichment. So we will also create two index templates and two ingest pipelines to make sure the data is processed by our Elastic Integrations.</p>
<h4>Creating the Elasticsearch Index Templates and Ingest Pipelines</h4>
<p>The following steps use <a href="https://www.elastic.co/guide/en/kibana/current/devtools-kibana.html">Dev Tools in Kibana</a>, but all of these steps can be completed via the REST API or using the relevant user interfaces in Stack Management.</p>
<p>First, we will create the Index Template and Ingest Pipeline for handling <strong>logs</strong>:</p>
<pre><code class="language-json">PUT _index_template/logs-kafka.reroute
{
  &quot;template&quot;: {
    &quot;settings&quot;: {
      &quot;index.default_pipeline&quot;: &quot;logs-kafka.reroute&quot;
    }
  },
  &quot;index_patterns&quot;: [
    &quot;logs-kafka.reroute-default&quot;
  ],
  &quot;data_stream&quot;: {}
}
</code></pre>
<pre><code class="language-json">PUT _ingest/pipeline/logs-kafka.reroute
{
  &quot;processors&quot;: [
    {
      &quot;reroute&quot;: {
        &quot;dataset&quot;: [
          &quot;{{data_stream.dataset}}&quot;
        ],
        &quot;namespace&quot;: [
          &quot;{{data_stream.namespace}}&quot;
        ]
      }
    }
  ]
}
</code></pre>
<p>Next, we will create the Index Template and Ingest Pipeline for handling <strong>metrics</strong>:</p>
<pre><code class="language-json">PUT _index_template/metrics-kafka.reroute
{
  &quot;template&quot;: {
    &quot;settings&quot;: {
      &quot;index.default_pipeline&quot;: &quot;metrics-kafka.reroute&quot;
    }
  },
  &quot;index_patterns&quot;: [
    &quot;metrics-kafka.reroute-default&quot;
  ],
  &quot;data_stream&quot;: {}
}
</code></pre>
<pre><code class="language-json">PUT _ingest/pipeline/metrics-kafka.reroute
{
  &quot;processors&quot;: [
    {
      &quot;reroute&quot;: {
        &quot;dataset&quot;: [
          &quot;{{data_stream.dataset}}&quot;
        ],
        &quot;namespace&quot;: [
          &quot;{{data_stream.namespace}}&quot;
        ]
      }
    }
  ]
}
</code></pre>
<p><strong>A note about rerouting</strong>: For a practical example of how this works, a document related to a Linux Network Metric would be first land in <code>metrics-kafka.reroute-default</code> and this Ingest Pipeline would inspect the document and find <code>data_stream.dataset</code> set to <code>system.network</code> and <code>data_stream.namespace</code> set to <code>default</code>. It would use these values to reroute the document from <code>metrics-kafka.reroute-default</code> to <code>metrics-system.network-default</code> where it would be processed by the <code>system</code> integration.</p>
<h3>Configure the Confluent Cloud Elasticsearch Sink Connector</h3>
<p>Now it's time to configure the Confluent Cloud Elasticsearch Sink Connector. We will perform the following steps twice and create two separate connectors, one connector for <strong>logs</strong> and one connector for <strong>metrics</strong>. Where the required settings differ, we will highlight the correct values.</p>
<p>Navigate to your Kafka cluster in Confluent Cloud and select Connectors from the left-hand navigation menu. On the Connectors page, select <code>Elasticsearch Service Sink</code> from a catalog of connectors available.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/deploying-elastic-agent-with-confluent-clouds-elasticsearch-connector/sink-connector-install.png" alt="Sink Connector Setup" /></p>
<p>Confluent Cloud presents a simplified workflow for the user to configure a connector. Here we will walk through each step of the process:</p>
<h4>Step 1: Topic Selection</h4>
<p>First, we will select the topic that the connector will consume data from based on which connector we are deploying:</p>
<ul>
<li>When deploying the Elasticsearch Sink Connector for <strong>logs</strong>, select the <code>logs</code> topic.</li>
<li>When deploying the Elasticsearch Sink Connector for <strong>metrics</strong>, select the <code>metrics</code> topic.</li>
</ul>
<h4>Step 2: Kafka Credentials</h4>
<p>Choose <code>KAFKA_API_KEY</code> as the cluster authentication mode. Provide the <code>API Key</code> and <code>Secret</code> noted earlier  when we gather required Confluent Cloud Cluster information. <img src="https://www.elastic.co/observability-labs/assets/images/deploying-elastic-agent-with-confluent-clouds-elasticsearch-connector/sink-connector-credentials.png" alt="Sink Connector Credentials" /></p>
<h4>Step 3: Authentication</h4>
<p>Provide the Elasticsearch Endpoint address of our Elasticsearch cluster as the <code>Connection URI</code>. The <code>Connection user</code> and <code>Connection password</code> are the authentication information for the account in Elasticsearch that will be used by the Elasticsearch Sink Connector to write data to Elasticsearch.</p>
<h4>Step 4: Configuration</h4>
<p>In this step we will keep the <code>Input Kafka record value format</code> set to <code>JSON</code>. Next, expand <code>Advanced Configuration</code>.</p>
<ol>
<li>We will set <code>Data Stream Dataset</code> to <code>kafka.reroute</code></li>
<li>We will set <code>Data Stream Type</code>based on the connector we are deploying:
<ul>
<li>When deploying the Elasticsearch Sink Connector for logs, we will set <code>Data Stream Type</code> to <code>logs</code></li>
<li>When deploying the Elasticsearch Sink Connector for metrics, we will set <code>Data Stream Type</code> to <code>metrics</code></li>
</ul>
</li>
<li>The correct values for other settings will depend on the specific environment.</li>
</ol>
<h4>Step 5: Sizing</h4>
<p>In this step, notice that Confluent Cloud provides a recommended minimum number of tasks for our deployment. Following the recommendation here is a good starting place for most deployments.</p>
<h4>Step 6: Review and Launch</h4>
<p>Review the <code>Connector configuration</code> and <code>Connector pricing</code> sections and if everything looks good, it's time to click <code>continue</code> and launch the connector! The connector may report as provisioning but will soon start consuming data from the Kafka topic and writing it to the Elasticsearch cluster.</p>
<p>You can now navigate to Discover in Kibana and find your logs flowing into Elasticsearch! Also check out the real time metrics that Confluent Cloud provides for your new Elasticsearch Sink Connector deployments.</p>
<p>If you have only deployed the first <code>logs</code> sink connector, you can now repeat the steps above to deploy the second <code>metrics</code> sink connector.</p>
<h2>Enjoy your fully managed data ingest architecture</h2>
<p>If you followed the steps above, congratulations. You have successfully:</p>
<ol>
<li>Configured Elastic Agent to send logs and metrics to dedicated topics in Kafka</li>
<li>Created publishing endpoints (data streams) in Elasticsearch dedicated to handling data from the Elasticsearch Sink Connector</li>
<li>Configured managed Elasticsearch Sink connectors to consume data from multiple topics and publish that data to Elasticsearch</li>
</ol>
<p>Next you should enable additional integrations, deploy more Elastic Agents, explore your data in Kibana, and enjoy the benefits of a fully managed data ingest architecture with Elastic Serverless and Confluent Cloud!</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/deploying-elastic-agent-with-confluent-clouds-elasticsearch-connector/title.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Log Processing UX Design in Elastic Streams]]></title>
            <link>https://www.elastic.co/observability-labs/blog/designing-log-processing-ux-for-streams</link>
            <guid isPermaLink="false">designing-log-processing-ux-for-streams</guid>
            <pubDate>Tue, 03 Mar 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Explore log processing in Elastic Streams and the design decisions behind the Processing UX that make log data more accessible, consistent, and actionable.]]></description>
            <content:encoded><![CDATA[<p>This post is written from the perspective of the Elastic Observability design team. It’s aimed at developers and SREs who work with logs and ingest pipelines, and it explains how design decisions shaped the Processing experience in Streams.</p>
<h2>The Design Problem in Log Processing</h2>
<p>We rarely talk about how projects actually begin. </p>
<p>How do you design something that doesn't fully exist yet?</p>
<p>How do you align AI capabilities, system constraints, real user pains into one coherent experience?</p>
<p><a href="https://www.elastic.co/elasticsearch/streams">Streams</a> gave us that challenge.</p>
<p>Logs are one of the richest signals in observability - but also one of the messiest. Streams is an agentic AI-powered solution that rethinks how teams work with logs to enable fast incident investigation and resolution. </p>
<p><em>Streams uses AI to partition and parse raw logs, extract relevant fields, reduce schema management overhead, and surface significant events like critical errors and anomalies.</em></p>
<p>This led us to make logs investigation-ready from the start, and not force the Site Reliability Engineer to fight their data. But in order to enable such experience, we had to carefully rethink a core concept and step in the process - Processing.</p>
<h2>Designing Processing UX in Elastic Streams</h2>
<p>Logs are powerful, but only if they are structured correctly. Today, a user would onboard logs via Elastic Agent, using a custom integration, extract something as simple as an IP field by:</p>
<ul>
<li>Write GROK patterns</li>
<li>Create pipelines</li>
<li>Manage mappings</li>
<li>Test transformation</li>
<li>Iterate repeatedly</li>
</ul>
<p>What sounds simple requires 20+ steps — and deep expertise most teams shouldn’t need. Our goal became simple: make this dramatically simpler.</p>
<p>Our early design question was:</p>
<p><em>“ Can we reduce this experience to 2 meaningful steps instead of 20 technical ones?”</em></p>
<p>That question shaped how we approached the Stream UX.</p>
<h3>The Foundation</h3>
<p>Before we jumped into designing the UI in <a href="https://www.elastic.co/kibana">Kibana</a>, we defined a core mental model. </p>
<p>A <a href="https://www.elastic.co/elasticsearch/streams">Stream</a> is a collection of documents stored together that share:</p>
<ul>
<li>Retention</li>
<li>Configuration</li>
<li>Mappings</li>
<li>Processing rules</li>
<li>Lifecycle behaviour</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/designing-log-processing-ux-for-streams/1.png" alt="stream-architecture" /></p>
<p>The key design principle:</p>
<p><em>“A Stream should contain data that behaves consistently.”</em></p>
<h3>Why Does Data Consistency Matter?</h3>
<p>We started with an example to test our thinking. Take Nginx access and error logs.</p>
<p>Access logs describe request/response events:</p>
<p><code>192.168.1.10 - - [16/Feb/2026:12:32:10 +0000] &quot;GET /api/orders/123 HTTP/1.1&quot; 200 532 &quot;-&quot; &quot;Mozilla/5.0&quot;</code></p>
<p>Error logs describe diagnostic events:</p>
<p><code>2026/02/16 12:32:10 [error] 2719#2719: *342 connect() failed (111: Connection refused) while connecting to upstream…</code></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/designing-log-processing-ux-for-streams/2.png" alt="log-example" /></p>
<p>If both live in the same Streams that might cause:</p>
<ul>
<li>Processing logic conflicts</li>
<li>Field divergence</li>
<li>Mapping conflicts</li>
<li>Investigations would be fundamentally harder</li>
</ul>
<p>That insight clarified something critical: </p>
<p><strong>“<em>Processing isn’t just about extracting fields. It’s about protecting consistency.”</em></strong></p>
<h3>Making Complexity Manageable</h3>
<p>The ingest ecosystem isn’t small, simple, or hypothetical. Real pipelines use dozens of processors — from common ones like <code>rename</code>, <code>set</code>, <code>convert</code>, and <code>append</code>, to niche types like <code>urldecode</code> and <code>network_direction</code>.</p>
<p>The UI had to support both high-frequency actions and long-tail edge cases without losing structure. Currently Elasticsearch supports over <a href="https://www.elastic.co/docs/reference/enrich-processor">40 different ingest processors</a>. We had to make sure our interface could handle the different types.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/designing-log-processing-ux-for-streams/3.png" alt="card-sample" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/designing-log-processing-ux-for-streams/4.png" alt="processor-panel" /></p>
<p>We introduced a clear, nested structure for pipeline steps. Users could create, reorder, edit, or remove individual steps or grouped ones with confidence. The <a href="https://eui.elastic.co/docs/patterns/nested-drag-and-drop/">nested drag and drop</a> capability was also added as a pattern in our EUI library.</p>
<p>This gave us the context and foundation to work on integrating those concepts into a model that would be definitive for everything in Streams.</p>
<h3>Page Archetypes</h3>
<p>Processing is powerful - and risky. Changing a parsing condition or step might affect:</p>
<ul>
<li>Field availability</li>
<li>Search behaviour</li>
<li>Alerts</li>
<li>AI Insights</li>
<li>Investigations</li>
</ul>
<p>So we asked ourselves how do we make something so powerful and important, safe for the user? The answer led to a core page archetype:</p>
<p><strong>Create &gt; Preview &gt; Confirm</strong></p>
<p>This wasn’t a UI pattern added later. It emerged directly from our concept work and understanding what users would have to deal with.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/designing-log-processing-ux-for-streams/5.png" alt="create-preview-confirm" /></p>
<p>To support this archetype and core idea, we also introduced a split-screen structure.</p>
<p><strong>Left: Build</strong></p>
<p>This is where users would:</p>
<ul>
<li>Add processing steps</li>
<li>Define conditions</li>
<li>Apply rules</li>
<li>Leverage AI suggestions both as a whole pipeline creation or individual steps like a GROK processor</li>
</ul>
<p>It remained focused, intentional and structured.</p>
<p><strong>Right Preview</strong></p>
<p>This is where users would:</p>
<ul>
<li>See real life log samples</li>
<li>See extracted fields in context</li>
<li>Immediate feedback on changes, with insights about the matched and unmatched percentage of documents</li>
<li>Optional drilldown side panel on the right</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/designing-log-processing-ux-for-streams/6.png" alt="split-screen-application" /></p>
<p>The preview panel became the anchor of confidence. This was not about visual symmetry, but to reinforce experimentation, control over errors and decrease the level of mistakes. Knowing that users might want to switch their focus from interaction to detailed preview, we introduced the resizeable function to both panels, and unlocked more flexiblity and control over the use cases.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/designing-log-processing-ux-for-streams/7.png" alt="stream-architecture" /></p>
<h3>AI Automation</h3>
<p>Streams is agentic and AI powered. That added another layer of complexity for the design, but also another opportunity to unlock even more power and insights from users' log data. </p>
<p>AI introduced a new tension: how do you accelerate processing without turning it into a black box?</p>
<p>We established a few guardrails:</p>
<ul>
<li>Clear, concise suggestions</li>
<li>Visible impact through matched document metrics</li>
<li>Inspectability</li>
<li>Alignment with the Create → Preview → Confirm model</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/designing-log-processing-ux-for-streams/8.png" alt="ai-in-split-screen-model" /></p>
<p>Processing UX became the bridge between automation and human in the loop. Log data is one of the most powerful investigation signals. Every design decision reinforced that belief.</p>
<h2>What We Learned</h2>
<p>Designing for the future does not start with screens. It starts with:</p>
<ul>
<li>Edge case testing</li>
<li>Clear mental models</li>
<li>Strong and guiding principles</li>
<li>Behavioral consistency</li>
<li>Scalable and stress-tested archetypes</li>
</ul>
<p>We know that in order for a user to be able unlock insightful discoveries from their logs, they would need to process and manage their data effectively. We knew we were shaping their entire observability foundation. </p>
<p>Processing is about trust, control, and scalable data management.</p>
<p>Trust enables investigation speed.</p>
<p>Investigation speed enables resilience.</p>
<h2>Learn more</h2>
<p>Sign up for an Elastic trial at <a href="http://cloud.elastic.co">cloud.elastic.co</a>, and trial Elastic's Serverless offering which will allow you to play with all of the Streams functionality.
You want to know more about Streams? Check some of the links below:</p>
<p><em>Read about</em> <a href="https://www.elastic.co/observability-labs/blog/reimagine-observability-elastic-streams"><em>Reimagining streams</em></a></p>
<p><em>Read about</em> <a href="https://www.elastic.co/observability-labs/blog/simplifying-retention-management-with-streams"><em>Retention management</em></a></p>
<p><em>Look at the</em> <a href="http://elastic.co/elasticsearch/streams"><em>Streams website</em></a></p>
<p><em>Check the</em> <a href="https://www.elastic.co/docs/solutions/observability/streams/streams"><em>Streams documentation</em></a></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/designing-log-processing-ux-for-streams/11.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[The DNA of DATA Increasing Efficiency with the Elastic Common Schema]]></title>
            <link>https://www.elastic.co/observability-labs/blog/dna-of-data</link>
            <guid isPermaLink="false">dna-of-data</guid>
            <pubDate>Wed, 25 Sep 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Elastic ECS helps improve semantic conversion of log fields. Learn how quantifying the benefits of normalized data, not just for infrastructure efficiency, but also data fidelity.]]></description>
            <content:encoded><![CDATA[<p>The Elastic Common Schema is a fantastic way to simplify and unify a search experience. By aligning disparate data sources into a common language, users have a lower bar to overcome with interpreting events of interest, resolving incidents or hunting for unknown threats. However, there are underlying infrastructure reasons to justify adopting the Elastic Common Schema.</p>
<p>In this blog you will learn about the quantifiable operational benefits of ECS, how to leverage ECS with any data ingest tool, and the pitfalls to avoid. The data source leveraged in this blog is a 3.3GB Nginx log file obtained from Kaggle. The representation of this dataset is divided into three categories: raw, self, and ECS; with raw having zero normalization, self being a demonstration of commonly implemented mistakes observed from my 5+ years of experience working with various users, and finally ECS with the optimal approach of data hygiene.</p>
<p>This hygiene is achieved through the parsing, enrichment, and mapping of data ingested; akin to the sequencing of DNA in order to express genetic traits. Through the understanding of the data's structure, and assigning the correct mapping, a more thorough expression may be represented, stored and searched upon.</p>
<p>If you would like to learn more about ECS, the dataset used in this blog, or available Elastic integrations, please be sure to check out these related links:</p>
<ul>
<li>
<p><a href="https://www.elastic.co/blog/introducing-the-elastic-common-schema">Introducing the Elastic Common Schema</a></p>
</li>
<li>
<p><a href="https://www.kaggle.com/datasets/eliasdabbas/web-server-access-logs">Kaggle Web Server Logs</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/integrations/data-integrations">Elastic Integrations</a></p>
</li>
</ul>
<h2>Dataset Validation</h2>
<p>Before we begin, let us review how many documents exist and what we're required to ingest. We have 10,365,152 documents/events from our Nginx log file:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/dna-of-data/access-logs.png" alt="nginx access logs" /></p>
<p>With 10,365,152 documents in our targeted end-state:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/dna-of-data/end-state.png" alt="end state" /></p>
<h2>Dataset Ingestion: Raw &amp; Self</h2>
<p>To achieve the raw and self ingestion techniques, this example is leveraging Logstash for simplicity. For the raw data ingest, a simple file input with no additional modifications or index templates.</p>
<pre><code>
    input {
      file {
      id =&gt; &quot;NGINX_FILE_INPUT&quot;
      path =&gt; &quot;/etc/logstash/raw/access.log&quot;
      ecs_compatibility =&gt; disabled
      start_position =&gt; &quot;beginning&quot;
      mode =&gt; read
      }
    }
    filter {
    }
    output {
      elasticsearch {
        hosts =&gt; [&quot;https://mycluster.es.us-east4.gcp.elastic-cloud.com:9243&quot;]
          index =&gt; &quot;nginx-raw&quot;
          ilm_enabled =&gt; true
          manage_template =&gt; false
          user =&gt; &quot;username&quot;
          password =&gt; &quot;password&quot;
          ssl_verification_mode =&gt; none
          ecs_compatibility =&gt; disabled
          id =&gt; &quot;NGINX-FILE_ES_Output&quot;
      }
    }

</code></pre>
<p>For the self ingest, a custom Logstash pipeline with a simple Grok filter was created with no index template applied:</p>
<pre><code>    input {
      file {
        id =&gt; &quot;NGINX_FILE_INPUT&quot;
        path =&gt; &quot;/etc/logstash/self/access.log&quot;
        ecs_compatibility =&gt; disabled
        start_position =&gt; &quot;beginning&quot;
        mode =&gt; read
      }
    }
    filter {
      grok {
        match =&gt; { &quot;message&quot; =&gt; &quot;%{IP:clientip} - (?:%{NOTSPACE:requestClient}|-) \[%{HTTPDATE:timestamp}\] \&quot;(?:%{WORD:requestMethod} %{NOTSPACE:request}(?: HTTP/%{NUMBER:httpversion})?|%{DATA:rawrequest})\&quot; (?:-|%{NUMBER:response}) (?:-|%{NUMBER:bytes_in}) (-|%{QS:bytes_out}) %{QS:user_agent}&quot; }
      }
    }
    output {
      elasticsearch {
        hosts =&gt; [&quot;https://myscluster.es.us-east4.gcp.elastic-cloud.com:9243&quot;]
        index =&gt; &quot;nginx-self&quot;
        ilm_enabled =&gt; true
        manage_template =&gt; false
        user =&gt; &quot;username&quot;
        password =&gt; &quot;password&quot;
        ssl_verification_mode =&gt; none
        ecs_compatibility =&gt; disabled
        id =&gt; &quot;NGINX-FILE_ES_Output&quot;
      }
    }
</code></pre>
<h2>Dataset Ingestion: ECS</h2>
<p>Elastic comes included with many available integrations which contain everything you need to achieve to ensure that your data is ingested as efficiently as possible.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/dna-of-data/integrations.png" alt="integrations" /></p>
<p>For our use case of Nginx, we'll be using the associated integration's assets only.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/dna-of-data/nginx-integration.png" alt="nginx integration" /></p>
<p>The assets which are installed are more than just dashboards, there are ingest pipelines which not only normalize but enrich the data while simultaneously mapping the fields to their correct type via component templates. All we have to do is make sure that as the data is coming in, that it will traverse through the ingest pipeline and use these supplied mappings.</p>
<p>Create your index template, and select the supplied component templates provided from your integration.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/dna-of-data/nginx-ecs.png" alt="nginx-ecs" /></p>
<p>Think of the component templates like building blocks to an index template. These allow for the reuse of core settings, ensuring standardization is adopted across your data.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/dna-of-data/nginx-ecs-template.png" alt="nginx-ecs-template" /></p>
<p>For our ingestion method, we merely point to the index name that we specified during the index template creation, in this case, <code>nginx-ecs</code> and Elastic will handle all the rest!</p>
<pre><code>    input {
      file {
      id =&gt; &quot;NGINX_FILE_INPUT&quot;
      path =&gt; &quot;/etc/logstash/ecs/access.log&quot;
      #ecs_compatibility =&gt; disabled
      start_position =&gt; &quot;beginning&quot;
      mode =&gt; read
      }
    }
    filter {
    }
    output {
      elasticsearch {
        hosts =&gt; [&quot;https://mycluster.es.us-east4.gcp.elastic-cloud.com:9243&quot;]
        index =&gt; &quot;nginx-ecs&quot;
        ilm_enabled =&gt; true
        manage_template =&gt; false
        user =&gt; &quot;username&quot;
        password =&gt; &quot;password&quot;
        ssl_verification_mode =&gt; none
        ecs_compatibility =&gt; disabled
        id =&gt; &quot;NGINX-FILE_ES_Output&quot;
      }
    }

</code></pre>
<h2>Data Fidelity Comparison</h2>
<p>Let's compare how many fields are available to search upon the three indices as well as the quality of the data. Our raw index has but 15 fields to search upon, with most being duplicates for aggregation purposes.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/dna-of-data/nginx-raw.png" alt="nginx-raw" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/dna-of-data/mapping-1.png" alt="mapping-1" /></p>
<p>However from a Discover perspective, we are limited to <code>6</code> fields!</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/dna-of-data/nginx-raw-discover.png" alt="nginx-raw-discover" /></p>
<p>Our self-parsed index has 37 available fields, however these too are duplicated and not ideal for efficient searching.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/dna-of-data/nginx-self.png" alt="nginx-self" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/dna-of-data/mapping-2.png" alt="mapping-2" /></p>
<p>From a Discover perspective here we have almost 3x as many fields to choose from, yet without the correct mapping the ease of which this data may be searched is less than ideal. A great example of this, is attempting to calculate the average bytes_in on a text field.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/dna-of-data/nginx-self-discover.png" alt="nginx-self-discover" /></p>
<p>Finally with our ECS index, we have 71 fields available to us! Notice that courtesy of the ingest pipeline, we have enriched fields of geographic information as well as event categorial fields.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/dna-of-data/nginx-ecs-pipeline.png" alt="nginx-ecs-pipeline" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/dna-of-data/mapping-3.png" alt="mapping-3" /></p>
<p>Now what about Discover? There were 51 fields directly available to us for searching purposes:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/dna-of-data/nginx-ecs-discover.png" alt="nginx-ecs-discover" /></p>
<p>Using Discover as our basis, our self-parsed index has 283% more fields to search upon whereas our ECS index has 850%! </p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/dna-of-data/table-1.png" alt="table-1" /></p>
<h2>Storage Utilization Comparison</h2>
<p>Surely with all these fields in our ECS index the size would be exponentially larger than the self normalized index, let alone the raw index? The results may surprise you.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/dna-of-data/total-storage.png" alt="total-storage" /></p>
<p>Accounting for the replica of data of our 3.3GB size data set, we can see that the impact of normalized and mapped data has a significant impact on the amount of storage required.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/dna-of-data/table-2.png" alt="table-2" /></p>
<h2>Conclusion</h2>
<p>While there is an increase in the amount required storage for any dataset that is enriched, Elastic provides easy solutions to maximize the fidelity of the data to be searched while simultaneously ensuring operational storage efficiency; that is the power of the Elastic Common Schema.</p>
<p>Let's review how we were able to maximize search, while minimizing storage</p>
<ul>
<li>Installing integration assets for our dataset that we are going to ingest.</li>
</ul>
<ul>
<li>Customizing the index template to leverage the included components to ensure mapping and parsing are aligned to the Elastic Common Schema.</li>
</ul>
<p>Ready to get started? Sign up <a href="https://cloud.elastic.co/registration">for Elastic Cloud</a> and try out the features and capabilities I've outlined above to get the most value and visibility out of your data.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/dna-of-data/dna-of-data.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Agent Skills for Elastic Observability]]></title>
            <link>https://www.elastic.co/observability-labs/blog/elastic-agent-skills-observability-workflows</link>
            <guid isPermaLink="false">elastic-agent-skills-observability-workflows</guid>
            <pubDate>Mon, 16 Mar 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how Agent Skills for Elastic Observability help SREs and developers run observability workflows through natural language to instrument apps with OpenTelemetry, search logs, manage SLOs, understand service health, and help with LLM observability.]]></description>
            <content:encoded><![CDATA[<p>Elastic Observability provides a wide set of capabilities, from configuring OpenTelemetry instrumentation, writing ES|QL queries to search logs and metrics, defining SLOs with the correct indicator types and equation syntax, triaging noisy alert storms, and stitching together service health from multiple signals. SREs are now looking to autmoate further with AI Agents.</p>
<p>Elastic's Agent skills are open source packages that give your AI coding agent native Elastic expertise. If you're already using Elastic Agent Builder, you get AI agents that work natively with your Observability data. The <a href="https://github.com/elastic/agent-skills">Elastic Agent Skills</a> deliver native platform expertise directly to your AI coding agent, so you can stop debugging AI-generated errors and start shipping production-ready code with the full depth of Elastic.</p>
<p>Skills can be used for specialized tasks across the Elastic stack — Elasticsearch, Kibana, Elastic Security, Elastic Observability, and more. Each skill lives in its own folder with a SKILL.md file containing metadata and instructions the agent follows.</p>
<p>Observability is releasing five skills that together cover the core workflows SREs and developers perform daily.Running Elastic Observability today involves a wide surface area: configuring OpenTelemetry instrumentation, writing ES|QL queries to search logs and metrics, defining SLOs with the correct indicator types and equation syntax, tand stitching together service health from multiple signals. Each of these tasks requires domain expertise and familiarity with specific APIs, index patterns, and Kibana workflows. For teams managing dozens of services across multiple environments, this is repetitive, error-prone, and time-consuming.</p>
<p>This article walks through the current Observability skill set, shows an end-to-end workflow, and highlights where these skills are useful in day-to-day operations.</p>
<h2>Why this matters for observability teams</h2>
<p>Modern observability work is usually ad hoc and cross-cutting. In one hour, you may instrument a new service, inspect logs for an incident, check error-budget status, and validate service health across several signals.</p>
<p>Each step often needs different APIs, index patterns, and Kibana workflows. Agent Skills package this task knowledge into reusable units so an agent can execute these steps consistently.</p>
<h2>The observability skills</h2>
<p>The observability set currently focuses on five connected workflows:</p>
<ol>
<li><strong>Instrument applications</strong> Adds the Elastic Distributions of OpenTelemetry to Python, Java, or .NET services (tracing, metrics, logs) or helps migrate from the classic Elastic APM agents to EDOT, with correct OTLP endpoints and configuration</li>
<li><strong>Search logs</strong> Provides visibility into Elastic Streams — the data routing and processing layer for observability data.</li>
<li><strong>Manage SLOs</strong> Creates and manages Service-Level Objectives in Elastic Observability via the Kibana API — from data exploration through SLO definition, creation, and lifecycle management.</li>
<li><strong>Assess service health</strong> Provides a unified view of service health by combining signals from APM, infrastructure metrics, logs, SLOs, and alerts into a single assessment.</li>
<li><strong>Observe LLM applications</strong> Monitors and troubleshoots LLM-powered applications — tracking token usage, latency, error rates, and model performance across inference calls.</li>
</ol>
<h2>What Agent Skills are</h2>
<p>Agent Skills are self-contained folders with instructions, scripts, and resources that an AI agent loads dynamically for a specific task. Elastic publishes official skills in <a href="https://github.com/elastic/agent-skills">elastic/agent-skills</a>, based on the <a href="https://agentskills.io/">Agent Skills standard</a>.</p>
<p>At a practical level, this means:</p>
<ul>
<li>You describe the goal.</li>
<li>The agent selects the relevant skill or you specify it.</li>
<li>The skill applies known consistent steps and API patterns, Elastic recommendeds, for that job.</li>
</ul>
<h2>Practical example: from incident question to root-cause</h2>
<p>As an SRE, you're notified that a specific customer is experiencing errors. Support has been trying to trouble shoot, but they need help. Support provides a transaction ID to investigate.</p>
<p>You've loaded Elastic's Agent Skills to Claude. You ask Claude:</p>
<p><code>Find out why transaction with id 01ba6cf8e60253bdeb26026caa3278a1 is having issues over the last 24 hours.</code></p>
<p>Claude, with Elastic O11y Skills added, analyzes the issue for that specific transaction with Elastic.</p>
<ol>
<li>it uses the log-search skill to narrow down likely causes</li>
<li>the root cause is identified</li>
<li>and a potential remediation is recommended</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-agent-skills-observability-workflows/Analyze-logs-for-transaction.png" alt="Claude Code interaction for log-search skill" /></p>
<h2>How to get started</h2>
<p>Install Elastic skills with the <code>skills</code> CLI:</p>
<pre><code class="language-bash">npx skills add elastic/agent-skills
</code></pre>
<p>Install a specific skill directly:</p>
<pre><code class="language-bash">npx skills add elastic/agent-skills --skill logs-search 
</code></pre>
<p>Then run your agent and give it an outcome-focused request, for example:</p>
<pre><code class="language-text">My cart service is experiencing some slowness, are there any errors over the last 3 hours? Please give me a summary of these logs.
</code></pre>
<p>The key shift is that the request is outcome-first. The skill captures implementation details such as API order, field expectations, and verification steps.</p>
<h2>What is next</h2>
<p>The planned scope includes broader workflow coverage. As skills mature, teams can combine them into repeatable operating patterns that still support ad hoc investigation.</p>
<p>If you want to try this model now, get <a href="https://github.com/elastic/agent-skills">Elastic's Agent Skills</a>, start with one service and one workflow:</p>
<ol>
<li>Assess service health.</li>
<li>Run guided log investigation for one real incident.</li>
<li>Add SLO management after baseline telemetry quality is in place.</li>
<li>Understand how well your LLM is performing for your developers.</li>
</ol>
<p>This gives you a concrete way to evaluate agent-assisted observability work without changing your full operating model in one step.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/elastic-agent-skills-observability-workflows/header2.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Accelerate log analytics in Elastic Observability with Automatic Import powered by Search AI]]></title>
            <link>https://www.elastic.co/observability-labs/blog/elastic-automatic-import-logs-genai</link>
            <guid isPermaLink="false">elastic-automatic-import-logs-genai</guid>
            <pubDate>Wed, 04 Sep 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Migrate your logs to AI-driven log analytics in record time by automating custom data integrations]]></description>
            <content:encoded><![CDATA[<p>Elastic is accelerating the adoption of <a href="https://www.elastic.co/observability/aiops">AI-driven log analytics</a> by automating the ingestion of custom logs, which is increasingly important as the deployment of GenAI-based applications grows. These custom data sources must be ingested, parsed, and indexed effortlessly, enabling broader visibility and more straightforward root cause analysis (RCA) without requiring effort from Site Reliability Engineers (SREs). Achieving visibility across an enterprise IT environment is inherently challenging for SREs due to constant growth and change, such as new applications, added systems, and infrastructure migrations to the cloud. Until now, the onboarding of custom data has been costly and complex for SREs. With automatic import, SREs can concentrate on deploying, optimizing, and improving applications.</p>
<p>Automatic Import uses generative AI to automate the development of custom data integrations, reducing the time required from several days to less than 10 minutes and significantly lowering the learning curve for onboarding data. Powered by the  <a href="https://www.elastic.co/platform">Elastic Search AI Platform</a>, it provides model-agnostic access to leverage large language models (LLMs) and grounds answers in proprietary data through <a href="https://www.elastic.co/search-labs/blog/retrieval-augmented-generation-rag">retrieval augmented generation (RAG)</a>. This capability is further enhanced by Elastic's expertise in enabling observability teams to utilize any type of data and the flexibility of its <a href="https://www.elastic.co/generative-ai/search-ai-lake">Search AI Lake</a>. Arriving at a crucial time when organizations face an explosion of applications and telemetry data, such as logs, Automatic Import streamlines the initial stages of data migration by simplifying data collection and normalization. It also addresses the challenges of building custom connectors, which can otherwise delay deployments, issue analysis, and impact customer experiences.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-automatic-import-logs-genai/auto-import-new-int.png" alt="Create new integration" /></p>
<h2>Enhancing AI Powered Observability with Automatic Import</h2>
<p><a href="https://www.elastic.co/observability">Automatic Import</a> builds on Elastic Observability’s AI-driven log analytics innovations—such as  <a href="https://www.elastic.co/guide/en/machine-learning/current/ml-getting-started.html">anomaly detection</a>, <a href="https://www.elastic.co/guide/en/kibana/current/xpack-ml-aiops.html">log rate and pattern analysis</a>, and <a href="https://www.elastic.co/blog/introducing-elastic-ai-assistant">Elastic AI Assistant</a>, and further automates and simplifies SRE’s workflows. Automatic Import applies generative AI to automate the creation of custom data integrations, allowing SREs to focus on logs and other telemetry data. While Elastic provides over <a href="https://www.elastic.co/integrations/data-integrations">400+ prebuilt data integrations</a>, automatic import allows SREs to extend integrations to fit their workflows and expand visibility into production environments.  </p>
<p>In conjunction with automatic import, Elastic is introducing <a href="https://www.elastic.co/blog/ai-log-analytics-express-migration">Elastic Express Migration</a>, a commercial incentive program designed to overcome migration inertia from existing deployments and contracts, providing a faster adoption path for new customers. </p>
<p>Automatic Import leverages <a href="https://www.elastic.co/blog/ecs-elastic-common-schema-otel-opentelemetry-faq">Elastic Common Schema (ECS)</a> with public LLMs to process and analyze data in ECS format which is also part of OpenTelemetry. Once the data is in, SRE’s can leverage Elastic’s RAG-based AI Assistant to solve root cause analysis (RCA) challenges in dynamic, complex environments.</p>
<h2>Configuring and using Automatic Import</h2>
<p>Automatic Import is available to everyone with an Enterprise license. Here is how it works:</p>
<ul>
<li>
<p>The user configures connectivity to an LLM and uploads sample data</p>
</li>
<li>
<p>Automatic Import then extrapolates what to expect from the data source. These log samples are paired with LLM prompts that have been honed by Elastic engineers to reliably produce conformant Elasticsearch ingest pipelines. </p>
</li>
<li>
<p>Automatic Import then iteratively builds, tests, and tweaks a custom ingest pipeline until it meets Elastic integration requirements.</p>
</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-automatic-import-logs-genai/auto-import-arch.png" alt="Create new integration Architecture" />
<em>Automatic Import powered by the Elastic Search AI Platform</em></p>
<p>Within minutes, a validated custom integration is created that accurately maps raw data into ECS and custom fields, populates contextual information (such as <code>related.*</code> fields), and categorizes events.</p>
<p>Automatic Import currently supports Anthropic models via <a href="https://www.elastic.co/guide/en/kibana/8.15/bedrock-action-type.html">Elastic’s connector for Amazon Bedrock</a>, and additional LLMs will be introduced soon. It supports JSON and NDJSON-based log formats currently.</p>
<h3>Automatic Import workflow</h3>
<p>SREs are constantly having to manage new tools and components that developers add into applications. Neo4j, is a database that doesn’t have an integration in Elastic. The following steps walk you through how to create an integration for Neo4j with automatic import:</p>
<ol>
<li>Start by navigating to <code>Integrations</code> -&gt; <code>Create new integration</code>.</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-automatic-import-logs-genai/auto-import-new-int.png" alt="Create new integration" /></p>
<ol start="2">
<li>Provide a name and description for the new data source.</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-automatic-import-logs-genai/auto-import-neo4j-setup.png" alt="Set up integration" /></p>
<ol start="3">
<li>Next, fill in other details and provide some sample data, anonymized as you see fit.</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-automatic-import-logs-genai/auto-import-pipline.png" alt="Set up pipeline" /></p>
<ol start="4">
<li>Click “Analyze logs” to submit integration details, sample logs, and expert-written instructions from Elastic to the specified LLM, which builds the integration package using generative AI. Automatic Import then fine-tunes the integration in an automated feedback loop until it is validated to meet Elastic requirements.</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-automatic-import-logs-genai/auto-import-analysis.png" alt="Analyze sample logs" /></p>
<ol start="5">
<li>Review what automatic Import presents as recommended mappings to ECS fields and custom fields. You can easily adjust these settings if necessary.</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-automatic-import-logs-genai/auto-import-finished.png" alt="Review Analysis" /></p>
<ol start="6">
<li>After finalizing the integration, add it to Elastic Agent or view it in Kibana. It is now available alongside your other integrations and follows the same workflows as prebuilt integrations.</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-automatic-import-logs-genai/auto-import-success.png" alt="Creation complete" /></p>
<ol start="7">
<li>Upon deployment, you can begin analyzing newly ingested data immediately. Start by looking at the new Logs Explorer in Elastic Observability</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-automatic-import-logs-genai/auto-import-explorer.png" alt="Look at logs" /></p>
<h2>Accelerate log-analytics with automatic import</h2>
<p>Automatic Import lowers the time required to build and test custom data integrations from days to minutes, accelerating the switch to <a href="https://www.elastic.co/observability/aiops">AI-driven log analytics</a>. Elastic Observability pairs the unique power of Automatic Import with Elastic’s deep library of prebuilt data integrations, enabling wider visibility and fast data onboarding, along with AI-based features, such as the Elastic AI Assistant to accelerate RCA and reduce operational overhead.</p>
<p>Interested in our <a href="https://www.elastic.co/splunk-replacement">Express Migration</a> program to level up to Elastic? <a href="https://www.elastic.co/splunk-interest?elektra=organic&amp;storm=CLP&amp;rogue=splunkobs-gic">Contact Elastic</a> to learn more. </p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
<p><em>In this blog post, we may have used or referred to third party generative AI tools, which are owned and operated by their respective owners. Elastic does not have any control over the third party tools and we have no responsibility or liability for their content, operation or use, nor for any loss or damage that may arise from your use of such tools. Please exercise caution when using AI tools with personal, sensitive or confidential information. Any data you submit may be used for AI training or other purposes. There is no guarantee that information you provide will be kept secure or confidential. You should familiarize yourself with the privacy practices and terms of use of any generative AI tools prior to use.</em> </p>
<p><em>Elastic, Elasticsearch, ESRE, Elasticsearch Relevance Engine and associated marks are trademarks, logos or registered trademarks of Elasticsearch N.V. in the United States and other countries. All other company and product names are trademarks, logos or registered trademarks of their respective owners.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/elastic-automatic-import-logs-genai/elastic-auto-importv2.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Connecting the Dots: ES|QL Joins for Richer Observability Insights]]></title>
            <link>https://www.elastic.co/observability-labs/blog/elastic-esql-join-observability</link>
            <guid isPermaLink="false">elastic-esql-join-observability</guid>
            <pubDate>Thu, 29 May 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Now in tech preview, ES|QL LOOKUP JOIN lets you enrich logs, metrics, and traces at query time no need to denormalize at ingest. Add deployment, infra, or business context dynamically, reduce storage, and accelerate root cause analysis in Elastic Obervability.]]></description>
            <content:encoded><![CDATA[<h1>Connecting the Dots: ES|QL Joins for Richer Observability Insights</h1>
<p>You might have seen our recent announcement about the <a href="https://www.elastic.co/blog/esql-lookup-join-elasticsearch">arrival of SQL-style joins in Elasticsearch</a> with ES|QL's LOOKUP JOIN command (now in Tech Preview!). While that post covered the basics, let's take a closer look at this in the context of Observability. How can this new join capability specifically help engineers and SREs make sense of their logs, metrics, and traces and make Elasticsearch more storage efficient by not denormalizing as much data?</p>
<p><strong>Note:</strong> Before we jump into the details, it’s important to mention again that this type of functionality today relies on a special lookup index. It is not (yet) possible to JOIN any arbitrary index.</p>
<p>Observability isn't just about collecting data; it's about understanding it. Often, the raw telemetry data – a log line, a metric point, a trace span – lacks the full context needed for quick diagnosis or impact assessment. We need to correlate data, enrich it with business or infrastructure context, and ask more advanced questions.</p>
<p>Historically, achieving this in Elasticsearch involved techniques like denormalizing data at ingest time (using ingest pipelines with enrich processors, for example) or performing joins client-side. </p>
<p>By adding the necessary context (like host details or user attributes) as data flowed in, each document arrived fully ready for queries and analytics without extra processing later on. This approach worked well in many cases and still does, particularly when the reference data changes slowly or when the enriched fields are critical for nearly every search. </p>
<p>However, as environments become more dynamic and diverse, the need to frequently update reference data (or avoid storing repetitive fields in every document) highlighted some of the trade-offs. </p>
<p>With the introduction of ES|QL LOOKUP JOIN in Elasticsearch 8.18 and 9.0, you now have an additional, more flexible option for situations where real-time lookups and minimal duplication are desired. Both methods—ingest-time enrichment and on-the-fly LOOKUP JOIN—complement each other and remain valid, depending on use case needs around update frequency, query performance, and storage considerations.</p>
<h2>Why Lookup Joins for Observability</h2>
<p>Lookup joins keep things flexible. You can decide on the fly if you’d like to look up additional information to assist you in your investigation.</p>
<p>Here are some examples:</p>
<ul>
<li>
<p><strong>Deployment Information:</strong> Which version of the code is generating these errors?</p>
</li>
<li>
<p><strong>Infrastructure Mapping:</strong> Which Kubernetes cluster or cloud region is experiencing high latency? What hardware does it use?</p>
</li>
<li>
<p><strong>Business Context:</strong> Are critical customers being affected by this slowdown?</p>
</li>
<li>
<p><strong>Team Ownership:</strong> Which team owns the service throwing these exceptions?</p>
</li>
</ul>
<p>Keeping this kind of information perfectly denormalized onto <em>every single</em> log line or metric point can be challenging and inefficient. Lookup datasets – like lists of deployments, server inventories, customer tiers, or service ownership mappings – often change independently of the telemetry data itself.</p>
<p><code>LOOKUP JOIN</code> is ideal here because:</p>
<ol>
<li>
<p><strong>Lookup Indices are Writable:</strong> Update your deployment list, CMDB export, or on-call rotation in the lookup index, and your <em>next</em> ES|QL query immediately uses the fresh data. No need to re-run complex enrich policies or re-index data.</p>
</li>
<li>
<p><strong>Flexibility:</strong> You decide <em>at query time</em> which context to join. Maybe today you care about deployment versions, tomorrow about cloud regions.</p>
</li>
<li>
<p><strong>Simpler Setup:</strong> As the original post highlighted, there are no enrich policies to manage. Just create an index with <code>index.mode: lookup</code> and load your data - up to 2 billion documents per lookup index.</p>
</li>
</ol>
<h2>Observability Use Cases &amp; Examples with ES|QL</h2>
<p>Let’s now look at a few examples to see how Lookup Joins can help.</p>
<h3>Enriching Error Logs with Deployment Context</h3>
<p>Lets say you're seeing a spike in errors for your <code>checkout-service</code>. You have logs flowing into a data stream, but they only contain the service name. The documents don’t have any information about the deployment activity itself. </p>
<pre><code class="language-bash">FROM logs-*
  | WHERE log.level == &quot;error&quot;
  | WHERE service.name == &quot;opbeans-ruby&quot;
</code></pre>
<p>You need to know if a recent deployment is contributing to these errors. To do this, we can maintain a <code>deployments_info_lkp</code> index (set with <code>index.mode: lookup</code>) that maps service names to their deployment times. This index could be updated from our CI/CD pipeline automatically any time a deployment happens.</p>
<pre><code class="language-bash">PUT /deployments_info_lkp
{
  &quot;settings&quot;: {
    &quot;index.mode&quot;: &quot;lookup&quot;
  },
  &quot;mappings&quot;: {
    &quot;properties&quot;: {
      &quot;service&quot;: {
        &quot;properties&quot;: {
          &quot;name&quot;: {
            &quot;type&quot;: &quot;keyword&quot;
          },
          &quot;deployment_time&quot;: {
            &quot;type&quot;: &quot;date&quot;
          },
          &quot;version&quot;: {
            &quot;type&quot;: &quot;keyword&quot;
          }
        }
      }
    }
  }
}
# Bulk index the deployment documents
POST /_bulk
{ &quot;index&quot; : { &quot;_index&quot; : &quot;deployments_info_lkp&quot; } }
{ &quot;service.name&quot;: &quot;opbeans-ruby&quot;, &quot;service.version&quot;: &quot;1.0&quot;, &quot;deployment_time&quot;: &quot;2025-05-22T06:00:00Z&quot; }
{ &quot;index&quot; : { &quot;_index&quot; : &quot;deployments_info_lkp&quot; } }
{ &quot;service.name&quot;: &quot;opbeans-go&quot;, &quot;service.version&quot;: &quot;1.1.0&quot;, &quot;deployment_time&quot;: &quot;2025-05-22T06:00:00Z&quot; }
</code></pre>
<p>Using this information you can now write a query that joins these two sources.</p>
<p><em>ES|QL Query:</em></p>
<pre><code class="language-bash">FROM logs-* 
  | WHERE log.level == &quot;error&quot;
  | WHERE service.name == &quot;opbeans-ruby&quot;
  | LOOKUP JOIN deployments_info_lkp ON service.name 
</code></pre>
<p>This alone is a good step towards troubleshooting the problem. You now have the deployment_time column available for each of your error documents. The last remaining step now is to use this for further filtering. </p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-esql-join-observability/discover.png" alt="Discover" /></p>
<p>Any of the data we managed to join from the lookup index can be handled as any other data we’d usually have available in the ES|QL query. This means that we can filter on it, and check if we had a recent deployment.</p>
<pre><code class="language-bash">FROM logs-*
  | WHERE log.level == &quot;error&quot;
  | WHERE service.name == &quot;opbeans-ruby&quot;
  | LOOKUP JOIN deployments_info_lkp ON service.name 
  | KEEP message, service.name, service.version, deployment_time 
  | WHERE deployment_time &gt; NOW() - 2h
</code></pre>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-esql-join-observability/discover2.png" alt="Discover2" /></p>
<h3>Saving disk space using JOIN</h3>
<p>Denormalizing data by including contextual information like host OS or cloud provider details directly in every log event is convenient for querying but can increase storage consumption, especially with high-volume data streams. Instead of storing this often-redundant information repeatedly, we can leverage joins to retrieve it on demand, potentially saving valuable disk space. While compression often handles repetitive data well, removing these fields entirely can still yield noticeable storage savings.</p>
<p>In this example we’ll use a dataset of 1,000,000 Kubernetes container logs using the default mapping of the Kubernetes integration, with <a href="https://www.elastic.co/docs/manage-data/data-store/data-streams/logs-data-stream">logsdb index mode</a> enabled. The starting size for this index is 35.5mb. </p>
<pre><code class="language-bash">GET _cat/indices/k8s-logs-default?h=index,pri.store.size
### 
k8s-logs-default       35.5mb
</code></pre>
<p>Using the <a href="https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-indices-disk-usage">disk usage API</a>, we observed that fields like host.os and cloud.* contribute roughly 5% to the total index size on disk (35.5mb). These fields can be useful in some cases, but information like the os.name is rarely queried. </p>
<pre><code class="language-bash">// Example host.os structure
&quot;os&quot;: {
  &quot;codename&quot;: &quot;Plow&quot;, &quot;family&quot;: &quot;redhat&quot;, &quot;kernel&quot;: &quot;6.6.56+&quot;,
  &quot;name&quot;: &quot;Red Hat Enterprise Linux&quot;, &quot;platform&quot;: &quot;rhel&quot;, &quot;type&quot;: &quot;linux&quot;, &quot;version&quot;: &quot;9.5 (Plow)&quot;
}

// Example cloud structure
&quot;cloud&quot;: {
  &quot;account&quot;: { &quot;id&quot;: &quot;elastic-observability&quot; },
  &quot;availability_zone&quot;: &quot;us-central1-c&quot;,
  &quot;instance&quot;: { &quot;id&quot;: &quot;5799032384800802653&quot;, &quot;name&quot;: &quot;gke-edge-oblt-edge-oblt-pool-46262cd0-w905&quot; },
  &quot;machine&quot;: { &quot;type&quot;: &quot;e2-standard-4&quot; },
  &quot;project&quot;: { &quot;id&quot;: &quot;elastic-observability&quot; },
  &quot;provider&quot;: &quot;gcp&quot;, &quot;region&quot;: &quot;us-central1&quot;, &quot;service&quot;: { &quot;name&quot;: &quot;GCE&quot; }
}
</code></pre>
<p>Instead of storing this information with every document, let's instead drop this information in an ingest pipeline.</p>
<pre><code class="language-bash">PUT _ingest/pipeline/drop-host-os-cloud
{
  &quot;processors&quot;: [
      { &quot;remove&quot;: { &quot;field&quot;: &quot;host.os&quot; } },
      { &quot;set&quot;: { &quot;field&quot;: &quot;tmp1&quot;, &quot;value&quot;: &quot;{{cloud.instance.id}}&quot; } }, // Temporarily store the ID
</code></pre>
<pre><code>      { &quot;remove&quot;: { &quot;field&quot;: &quot;cloud&quot; } },                             // Remove the entire cloud object
      { &quot;set&quot;: { &quot;field&quot;: &quot;cloud.instance.id&quot;, &quot;value&quot;: &quot;{{tmp1}}&quot; } }, // Restore just the cloud instance ID
      { &quot;remove&quot;: { &quot;field&quot;: &quot;tmp1&quot;, &quot;ignore_missing&quot;: true } }         // Clean up temporary field
    ]
}
</code></pre>
<p>Reindexing (and <a href="https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-indices-forcemerge">force merging to one segment</a>) now shows the following size, resulting in approximately 5% less space. </p>
<pre><code class="language-bash">GET _cat/indices/k8s-logs-*?h=index,pri.store.size
### 
k8s-logs-default             33.7mb
k8s-logs-drop-cloud-os       35.5mb
</code></pre>
<p>Now, to regain access to the removed host.os and cloud.* information during analysis without storing it in every log document, we can create a lookup index. This index will store the full host and cloud metadata, keyed by the cloud.instance.id that we preserved in our logs. This instance_metadata_lkp index will be significantly smaller than the space saved across millions or billions of log lines, as it only needs one document per unique instance.</p>
<pre><code class="language-bash"># Create the lookup index for instance metadata
PUT /instance_metadata_lkp
{
  &quot;settings&quot;: {
    &quot;index.mode&quot;: &quot;lookup&quot;
  },
  &quot;mappings&quot;: {
    &quot;properties&quot;: {
</code></pre>
<pre><code class="language-bash">      &quot;cloud.instance.id&quot;: {  # The join key we kept in the logs
        &quot;type&quot;: &quot;keyword&quot;
      },
      &quot;host.os&quot;: {           # The full host.os object we removed
        &quot;type&quot;: &quot;object&quot;,
        &quot;enabled&quot;: false      # Often don't need to search sub-fields here
      },
      &quot;cloud&quot;: {             # The full cloud object we removed (mostly)
         &quot;type&quot;: &quot;object&quot;,
         &quot;enabled&quot;: false     # Often don't need to search sub-fields here
      }
    }
  }
}

# Bulk index sample instance metadata (keyed by cloud.instance.id)
# This data might come from your cloud provider API or CMDB
POST /_bulk
{ &quot;index&quot; : { &quot;_index&quot; : &quot;instance_metadata_lkp&quot;, &quot;_id&quot;: &quot;5799032384800802653&quot; } }
{ &quot;cloud.instance.id&quot;: &quot;5799032384800802653&quot;, &quot;host.os&quot;: { &quot;codename&quot;: &quot;Plow&quot;, &quot;family&quot;: &quot;redhat&quot;, &quot;kernel&quot;: &quot;6.6.56+&quot;, &quot;name&quot;: &quot;Red Hat Enterprise Linux&quot;, &quot;platform&quot;: &quot;rhel&quot;, &quot;type&quot;: &quot;linux&quot;, &quot;version&quot;: &quot;9.5 (Plow)&quot; }, &quot;cloud&quot;: { &quot;account&quot;: { &quot;id&quot;: &quot;elastic-observability&quot; }, &quot;availability_zone&quot;: &quot;us-central1-c&quot;, &quot;instance&quot;: { &quot;id&quot;: &quot;5799032384800802653&quot;, &quot;name&quot;: &quot;gke-edge-oblt-edge-oblt-pool-46262cd0-w905&quot; }, &quot;machine&quot;: { &quot;type&quot;: &quot;e2-standard-4&quot; }, &quot;project&quot;: { &quot;id&quot;: &quot;elastic-observability&quot; }, &quot;provider&quot;: &quot;gcp&quot;, &quot;region&quot;: &quot;us-central1&quot;, &quot;service&quot;: { &quot;name&quot;: &quot;GCE&quot; } } }
</code></pre>
<p>With this setup, when you need the full host or cloud context for your logs, you can simply use LOOKUP JOIN in your ES|QL query and continue filtering on the data from the lookup index</p>
<pre><code class="language-bash">FROM logs-* 
  | LOOKUP JOIN instance_metadata_lkp ON cloud.instance.id 
  | WHERE cloud.region == &quot;us-central1&quot;
</code></pre>
<p>This approach allows us to query the full context when needed (e.g., filtering logs by host.os.name or cloud.region) while significantly reducing the storage footprint of the high-volume log indices by avoiding redundant data denormalization.</p>
<p>It should be noted that low cardinality metadata fields generally compress well and a large part of the storage savings in this case come from the “text” mapping of the host.os.name and cloud.instance.name field. Make sure to use the disk usage API to evaluate if this approach would be worth it in your specific use case. </p>
<h2>Getting Started with Lookups for Observability</h2>
<p>Creating the necessary lookup indices is straightforward. As detailed in our <a href="http://link-to-original-blog-post">initial blog post</a>, you can use Kibana's Index Management UI, the Create Index API, or the File Upload utility – the key is setting <code>&quot;index.mode&quot;: &quot;lookup&quot;</code> in the index settings.</p>
<p>For Observability, consider automating the population of these lookup indices:</p>
<ul>
<li>
<p>Export data periodically from your CMDB, CRM, or HR systems.</p>
</li>
<li>
<p>Have your CI/CD pipeline update the <code>deployments_lkp</code> index upon successful deployment.</p>
</li>
<li>
<p>Use tools like Logstash with an <code>elasticsearch</code> output configured to write to your lookup index.</p>
</li>
</ul>
<h2>A Note on Performance and Alternatives</h2>
<p>While incredibly powerful, joins aren't free. Each <code>LOOKUP JOIN</code> adds processing overhead to your query. For contextual data that is <em>very</em> static (e.g., the cloud region a host <em>permanently</em> resides in) and needed in <em>almost every</em> query against that data, the traditional approach of enriching at ingest time might still be slightly more performant for those specific queries, trading upfront processing and storage for query speed.</p>
<p>However, for the dynamic, flexible, and targeted enrichment scenarios common in Observability – like mapping to ever-changing deployments, user segments, or team structures – <code>LOOKUP JOIN</code> offers a compelling, efficient, and easier-to-manage solution.</p>
<h2>Conclusion</h2>
<p>ES|QL's <code>LOOKUP JOIN</code> is making it easy to correlate and enrich your logs, metrics, and traces with up-to-date external information <em>at query time</em>; you can move faster from detecting problems to understanding their scope, impact, and root cause.</p>
<p>This feature is currently in Technical Preview in Elasticsearch 8.18 and Serverless, available now on Elastic Cloud. We encourage you to try it out with your own Observability data and share your feedback using the &quot;Submit feedback&quot; button in the ES|QL editor in Discover. We're excited to see how you use it to connect the dots in your systems!</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/elastic-esql-join-observability/esql-join.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[From raw logs to system knowledge: the AI context layer observability is missing]]></title>
            <link>https://www.elastic.co/observability-labs/blog/elastic-knowledge-indicators-log-extraction</link>
            <guid isPermaLink="false">elastic-knowledge-indicators-log-extraction</guid>
            <pubDate>Tue, 05 May 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[A self-updating knowledge base built from your logs: services, dependencies, and failure modes, so your AI agents always know what they are looking at.]]></description>
            <content:encoded><![CDATA[<p>Your monitoring system sees everything, but understands almost nothing.</p>
<p>Before you can rely on most tools to trigger a meaningful alert, you have to do the heavy lifting of telling them exactly what to watch. You have to write the rules, specify what a &quot;normal&quot; baseline looks like, and manually define your service catalog. We're working to change that dynamic at Elastic, and the first major building block is now in place: a system designed to simply read your logs and figure out what's inside them on its own.</p>
<p>Consider what happens when an alert fires today. Your on-call engineer opens an investigation, and the first few minutes are inevitably burned reconstructing basic facts. They have to figure out which services are involved, how those services connect to one another, what error patterns are typical, and which queries they actually need to run to dig deeper. An AI agent faces this exact same cold start problem. Without prior knowledge of your system's architecture, an agent has to read through hundreds of log lines just to establish baseline context that really should already be available.</p>
<p>This blank slate is the default state of most observability setups. You only know what you've explicitly configured. When new services spin up and start writing logs, they sit there without rules until someone takes the time to write them. When architectural dependencies shift, your topology map quietly goes stale unless you've done an exceptional job instrumenting all your services. If an error pattern fires every day but nobody wrote a specific rule to catch it, it remains invisible.</p>
<p>Knowledge Indicators (KI) are our way of closing this gap. When you run extraction against a log stream, Elastic analyzes the raw data and returns structured facts about your environment. It identifies which services are running, the underlying infrastructure they rely on, how they depend on each other, and the log schemas they're using. It even generates a set of ES|QL queries for conditions that might be worth alerting on. Rather than a static configuration, this knowledge accumulates over time, automatically expires when a service disappears, and feeds directly into downstream capabilities like Rules, topology maps, AI agent investigations, and dashboards.</p>
&lt;div align=&quot;center&quot;&gt;
![Knowledge Indicators graph: service nodes (claim-intake, fraud-check, policy-lookup, payment-processor, kafka, notification-dispatch, kubernetes) connected by labeled dependency edges (connection refused, ECONNREFUSED, gRPC UNAVAILABLE, pool exhausted, pod sync)](/assets/images/elastic-knowledge-indicators-log-extraction/topology-graph@2x.png)
_Topology graph generated from dependency KIs: service nodes, dependency edges, and detected error conditions._
&lt;/div&gt;
<h2>The Extraction Pipeline</h2>
<p>When designing this system, our primary goal was to eliminate the need for prior context. There should be no mandatory schemas, no service catalogs tied to specific properties, and no predefined static assets that would need to be maintained. We asked ourselves a simple question: if you handed a sample of raw logs to an engineer who had never seen the system before, what could they deduce just by looking?</p>
<p>That thought experiment became our core approach. The system samples a small batch of logs from a stream, processes them through a combination of LLM analysis and deterministic code generators, and accumulates its findings across multiple rounds, entirely configuration-free.</p>
<p>Imagine hiring a room full of junior SREs with one specific job: read these log lines and report their observations, not to fix anything or trigger alarms, just to notice things. &quot;This looks like an nginx server,&quot; or &quot;This database is PostgreSQL,&quot; or &quot;Service A is calling Service B over HTTP.&quot; That's essentially what our extraction job is doing continuously across your streams.</p>
<p>To see how this works in practice, take a look at this single line from an nginx access log:</p>
<pre><code>192.168.1.45 - - [31/Mar/2026:14:23:01 +0000] &quot;POST /api/v2/claims HTTP/1.1&quot; 200 1247 &quot;-&quot; &quot;claim-intake/1.4.2&quot;
</code></pre>
<p>From just this string, the pipeline extracts three distinct facts:</p>
<ul>
<li><strong>Entity</strong>: <code>claim-intake</code> (identifiable as a service from the User-Agent)</li>
<li><strong>Version</strong>: <code>1.4.2</code> (extracted from the User-Agent string)</li>
<li><strong>Technology</strong>: nginx (the web server fielding the request)</li>
<li><strong>Schema</strong>: Combined Log Format</li>
</ul>
<p>Similarly, consider this Java service log:</p>
<pre><code>2026-03-31T14:23:03.412Z INFO fraud-check --- [nio-8080-exec-3] c.e.FraudCheckService : Calling upstream POST http://policy-lookup:8081/v1/policy latency=142ms status=200
</code></pre>
<p>Here, the extraction identifies:</p>
<ul>
<li><strong>Entity</strong>: <code>fraud-check</code> (a Spring Boot service)</li>
<li><strong>Dependency</strong>: <code>fraud-check</code> → <code>policy-lookup</code> (via an outbound HTTP call)</li>
<li><strong>Technology</strong>: Java, Spring Boot</li>
</ul>
<p>Pull twenty lines like these from across your stream, and you quickly build a working, accurate picture of your system architecture.</p>
<p>To ensure this process never blocks ingestion, extraction runs entirely as a background task. You can trigger it on demand from the stream detail view or the Significant Events Discovery UI, but the goal is to have it running by default without requiring attention.</p>
<p>The pipeline itself runs multiple iterations, each time fetching a small sample of documents. We use a mix of random and already-excluded documents to ensure we discover the full scope of the system. KIs found in one iteration are fed back as exclusions into the next, so each round focuses on what the previous one missed—ensuring quieter, less-represented services aren't crowded out by noisier ones.</p>
&lt;div align=&quot;center&quot;&gt;
![KI extraction pipeline: raw logs → three-pool biased sampling (entity-filtered / diverse / random) → LLM finalize_features and 4 computed generators in parallel → merge and dedup → 84 KIs stored](/assets/images/elastic-knowledge-indicators-log-extraction/ki-extraction-pipeline@2x.png)
_Extraction pipeline: biased document sampling feeds a parallel LLM pass and four deterministic generators. Results are merged and deduplicated before storage._
&lt;/div&gt;
<p>Once sampled, the documents are sent to an LLM. We use a system prompt that instructs the model to identify a few specific types of features, which we plan to extend over time:</p>
<table>
<thead>
<tr>
<th>Type</th>
<th>What it captures</th>
</tr>
</thead>
<tbody>
<tr>
<td>Entity</td>
<td>Distinct system components: services, applications, jobs</td>
</tr>
<tr>
<td>Infrastructure</td>
<td>Environment context: Kubernetes, cloud provider, OS</td>
</tr>
<tr>
<td>Technology</td>
<td>Languages, frameworks, libraries, databases</td>
</tr>
<tr>
<td>Dependency</td>
<td>Relationships between components</td>
</tr>
<tr>
<td>Schema</td>
<td>Log format conventions: ECS, OTel, custom</td>
</tr>
</tbody>
</table>
<p>The LLM returns its findings, delivering newly identified traits alongside any intentionally ignored ones (like user-excluded false positives). To be accepted, every feature must include stable identifying properties and cite direct evidence from the sampled logs. The LLM also assigns a confidence score from 0–100 for each KI, so any downstream use of that KI knows how much to trust it.</p>
<p>In parallel, a set of deterministic code-based generators independently analyze the data to produce statistical summaries, log samples, pattern clusters, and error-specific features. Because these are computed rather than inferred, they always receive a confidence score of 100.</p>
<p>Finally, the LLM results and computed features are merged and deduplicated. Known KIs reuse their existing UUIDs, new discoveries get fresh ones, and any user-excluded features are quietly dropped server-side. Surviving KIs are saved with an active status and an expiration date set for seven days out.</p>
&lt;div align=&quot;center&quot;&gt;
![Knowledge Indicators tab showing 84 KIs across streams, with type, confidence (1–5 stars), and stream columns visible](/assets/images/elastic-knowledge-indicators-log-extraction/kis.png)
_Knowledge Indicators tab showing 84 KIs across streams, with type, confidence (1–5 stars), and stream columns._
&lt;/div&gt;
<h2>What a Knowledge Indicator Contains</h2>
<p>Knowledge Indicators fall into two categories: Feature KIs and Query KIs.</p>
<p>Feature KIs are descriptive. They explain the contents of the stream: what services are running, the infrastructure housing them, their dependencies, and the active tech stack.</p>
<p>Query KIs are actionable. They are ready-to-run ES|QL queries targeting notable conditions like connection exhaustion, out-of-memory errors, or fatal exceptions. Each comes with a severity score from 0 to 100, and when promoted to Rules, they fire Events.</p>
<p>Feature KIs carry a full data model:</p>
<ul>
<li><strong><code>type</code> / <code>subtype</code></strong>: the category of the fact (Entity, Infrastructure, Technology, Dependency, Schema)</li>
<li><strong><code>title</code> / <code>description</code></strong>: a human-readable summary</li>
<li><strong><code>properties</code></strong>: stable key-value pairs used to deduplicate findings across multiple runs</li>
<li><strong><code>confidence</code></strong>: 0–100. LLM-identified KIs score based on evidence quality. Deterministic KIs always score 100.</li>
<li><strong><code>evidence</code></strong>: 2–5 supporting log excerpts that justify the KI's existence</li>
<li><strong><code>filter</code></strong>: an optional StreamLang condition scoping the KI to specific documents</li>
</ul>
<p>A dependency KI looks like this:</p>
<pre><code class="language-json">{
  &quot;type&quot;: &quot;dependency&quot;,
  &quot;subtype&quot;: &quot;service_dependency&quot;,
  &quot;title&quot;: &quot;api_gateway → inference_service&quot;,
  &quot;description&quot;: &quot;Service-to-service HTTP dependency from api_gateway to inference_service, observed in request logs&quot;,
  &quot;properties&quot;: {
    &quot;source&quot;: &quot;api_gateway&quot;,
    &quot;target&quot;: &quot;inference_service&quot;,
    &quot;protocol&quot;: &quot;http&quot;
  },
  &quot;confidence&quot;: 85,
  &quot;evidence&quot;: [
    &quot;service.name=api_gateway http.url=/v1/inference peer.service=inference_service&quot;,
    &quot;upstream=inference_service:8080 request=POST /v1/inference 200&quot;
  ],
  &quot;filter&quot;: { &quot;field&quot;: &quot;service.name&quot;, &quot;eq&quot;: &quot;api_gateway&quot; },
  &quot;status&quot;: &quot;active&quot;,
  &quot;expires_at&quot;: &quot;2026-04-09T00:00:00Z&quot;
}
</code></pre>
<p>Query KIs take a simpler shape, focusing solely on the title, severity score, and the executable query:</p>
<pre><code class="language-json">{
  &quot;kind&quot;: &quot;query&quot;,
  &quot;title&quot;: &quot;PostgreSQL connection slot exhaustion&quot;,
  &quot;description&quot;: &quot;Fires when Postgres runs out of available connection slots&quot;,
  &quot;severity_score&quot;: 90,
  &quot;esql&quot;: {
    &quot;query&quot;: &quot;FROM logs-* | WHERE service.name == \&quot;postgres\&quot; AND message : \&quot;remaining connection slots\&quot;&quot;
  }
}
</code></pre>
<p>The <code>properties</code> field is what keeps Feature KIs stable across multiple pipeline runs. The dependency KI for <code>api_gateway → inference_service</code> records the source, target, and protocol as fixed pairs. The next time extraction runs, Elastic recognizes this existing relationship and updates the KI's <code>last_seen</code> timestamp rather than creating a duplicate.</p>
&lt;div align=&quot;center&quot;&gt;
![KI detail panel for api_gateway → inference_service showing type, subtype, properties, confidence, evidence, and expiry date](/assets/images/elastic-knowledge-indicators-log-extraction/ki-detail.png)
_KI detail panel showing type, subtype, properties, confidence, evidence, and expiry date for a service dependency._
&lt;/div&gt;
<h2>The Foundation for Intelligent Observability</h2>
<p>So what can we do with all of this? These KIs serve as the contextual foundation for Elastic's more advanced capabilities. From just these extracted KIs, we can automatically generate active Rules to surface interesting signals, without a human engineer writing a single line of configuration. More on this particular capability in the next post in this series.</p>
&lt;div align=&quot;center&quot;&gt;
![85 auto-generated Rules from KIs with impact ratings (Critical/High) and event occurrence sparklines](/assets/images/elastic-knowledge-indicators-log-extraction/queries.png)
_85 auto-generated Rules from 84 KIs, with impact ratings and event occurrence sparklines._
&lt;/div&gt;
<p>As a user or an agent, the dependency KIs automatically construct an infrastructure graph—inferred entirely from log data, not from distributed tracing or any manual configuration. During an incident, this graph is invaluable for assessing blast radius. If a specific database goes down, the topology map immediately shows you exactly which upstream services are about to fail, without maintaining a manual service catalog.</p>
&lt;div align=&quot;center&quot;&gt;
![Topology graph showing service dependencies: claim-intake connects to claim-intake-db (PostgreSQL), fraud-check, policy-lookup; fraud-check connects to fraud-check-db (MongoDB); policy-lookup connects to policy-lookup-db (PostgreSQL); payment-processor and kafka grouped separately; notification-dispatch and kubernetes at bottom](/assets/images/elastic-knowledge-indicators-log-extraction/topology-dependencies@2x.png)
_Service dependency graph extracted from KIs, showing services, databases, and infrastructure components._
&lt;/div&gt;
<p>This context changes how an AI agent handles an incident. Instead of starting from scratch, the agent initiates its investigation using your system's actual topology and known failure modes. Based on the KIs, it identifies the relevant streams, runs the applicable queries, and formulates a specific hypothesis. In our example, it already knows that <code>api_gateway</code> relies on <code>inference_service</code>, and it knows that connection slot exhaustion is a high-severity failure mode for your Postgres instance.</p>
<p>This extracted knowledge doesn't have to be perfect to be useful. Because LLMs are inherently non-deterministic, a KI might occasionally be slightly off, but it still gives the agent a significant head start. The agent can cross-reference the KI against live logs and self-correct on the fly. The real benefit is simply not having to reconstruct basic facts during a critical outage. KIs also drive AI-generated dashboard suggestions and inform Grok pattern generation whenever you introduce new streams.</p>
<h2>Self-Cleaning and Scalable</h2>
<p>Maintaining this knowledge base is entirely hands-off. KIs auto-expire after 7 days if they aren't observed in subsequent extraction runs. If you decommission a service, its associated KIs simply fade away without any manual cleanup. If the service comes back online later, the KIs are re-extracted. Users can also mark individual feature KIs as false positives, and the system carries those exclusions forward into future runs to prevent re-identification.</p>
<p>Because we scoped KI extraction as a specific classification task, looking at around 20 log samples to identify services, infrastructure, and dependencies, it doesn't require a large frontier model to run. A fast, cost-effective model handles this without multi-step reasoning.</p>
<h2>You Shouldn't Have to Tell Your Tools What to Watch</h2>
<p>The fundamental promise of observability is to help you understand your systems. For far too long, the burden of teaching the tool how those systems actually work has fallen on the engineers operating them.</p>
<p>The next post in this series looks at what agents do with that context: why every agent that investigates your system without KIs re-learns the same things from scratch on every incident, and what changes when it doesn't have to.</p>
<p><strong>NOTE:</strong> These capabilities are available behind a feature flag in Serverless Observability projects. Turn on <code>observability:streamsEnableSignificantEvents</code> by searching for it in the Kibana advanced settings page.</p>
<h2>Frequently asked questions</h2>
<p><strong>What are Knowledge Indicators in Elasticsearch Streams?</strong>
Knowledge Indicators (KIs) are structured facts extracted from raw log streams: service names, infrastructure components, service-to-service dependencies, and tech stack details. Elastic extracts them automatically by sampling log lines, without requiring schemas, service catalogs, or manual configuration.</p>
<p><strong>How does Elastic build a service topology map from logs alone?</strong>
The extraction pipeline samples log lines and identifies dependency relationships, such as an outbound HTTP call from one service to another. These dependency KIs are used to construct a topology graph that shows which services depend on which, entirely inferred from log data, without distributed tracing or any manual input.</p>
<p><strong>Why does an AI agent need Knowledge Indicators before investigating an incident?</strong>
Without KIs, an AI agent starts every investigation from scratch: it has to read hundreds of log lines just to establish which services exist and how they relate. KIs give the agent a pre-built map of your system, including services, known failure modes, and relevant queries, so it can begin reasoning about the actual incident immediately.</p>
<p><strong>Do I need to configure anything for Knowledge Indicator extraction to work?</strong>
No. The pipeline requires no schema definitions, no service catalog, and no predefined rules. It samples a small set of log lines from a stream, analyzes them through a combination of LLM inference and deterministic generators, and accumulates findings automatically.</p>
<p><strong>How accurate are LLM-extracted KIs compared to computed ones?</strong>
Computed (deterministic) KIs always receive a confidence score of 100 because they are derived from statistical analysis rather than inference. LLM-extracted KIs receive scores from 0 to 100 based on the quality of evidence found in the sampled logs. Rules, agent investigations, and topology maps can all use this score to weight their decisions.</p>
<p><strong>What happens when a service is decommissioned?</strong>
KIs carry a 7-day expiration. If a service stops appearing in subsequent extraction runs, its KIs expire and are removed automatically. No manual cleanup required. If the service comes back, the KIs are re-extracted on the next run.</p>
<p><strong>How does this compare to service discovery via distributed tracing?</strong>
Distributed tracing requires instrumented services and a trace collector. Knowledge Indicator extraction requires nothing beyond existing log streams: no SDK, no agent, no schema. For environments with partial or no tracing coverage, KI extraction provides topology and dependency information that tracing would otherwise miss.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/elastic-knowledge-indicators-log-extraction/cover.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Connecting Cursor to Production Logs via the Elastic MCP Server]]></title>
            <link>https://www.elastic.co/observability-labs/blog/elastic-mcp-server-cursor-production-logs</link>
            <guid isPermaLink="false">elastic-mcp-server-cursor-production-logs</guid>
            <pubDate>Wed, 29 Apr 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how to connect Cursor to your Elastic APM data using the Elastic Agent Builder MCP server, so you can debug production errors and make UI decisions backed by real usage data without leaving your editor.]]></description>
            <content:encoded><![CDATA[<h2>Prerequisites</h2>
<ul>
<li>
<p>Elasticsearch 9.3+ (or Elastic Cloud Serverless)</p>
</li>
<li>
<p>Elasticsearch API KEY and Kibana URL</p>
</li>
<li>
<p>An application instrumented with Elastic APM: the <a href="https://www.elastic.co/guide/en/apm/agent/rum-js/current/index.html">RUM agent</a> for frontend interactions (populates <code>traces-apm-*</code>) and the <a href="https://www.elastic.co/docs/reference/apm-agents">APM agent</a> for backend errors (populates <code>logs-apm.error-*</code></p>
</li>
<li>
<p><a href="https://cursor.com/home">Cursor</a> (version 2.6+) installed</p>
</li>
</ul>
<h2>The problem with two worlds</h2>
<p>Application logs and code are two separate worlds that don't talk to each other. If you want to apply log insights into the application you have to analyze the logs, and then come back to the editor and apply your findings.</p>
<p>The <a href="https://modelcontextprotocol.io/">Model Context Protocol (MCP)</a> changes this. MCP is an open standard that lets AI clients like Cursor connect to external tools and data sources through a standardized interface. Instead of your IDE only knowing about your local code, it can also talk to your Elasticsearch cluster, query your APM data, and reason about production behavior alongside your source files.</p>
<p>Elastic ships a <a href="https://www.elastic.co/docs/explore-analyze/ai-features/agent-builder/mcp-server">built-in MCP server</a> as part of <a href="https://www.elastic.co/docs/explore-analyze/ai-features/elastic-agent-builder">Agent Builder</a>. You define tools in Kibana, expose them via the MCP endpoint, and any MCP-compatible client can call them. Cursor supports MCP natively, which means you can set this up in minutes.</p>
<h2>What we're building</h2>
<p>We're working with an eCommerce search app instrumented with Elastic APM. The RUM JS agent tracks filter click interactions from the browser, stored in <code>traces-apm-default</code>. The Node.js APM agent captures backend errors, stored in <code>logs-apm.error-default</code>.</p>
<p>Two situations come up during development:</p>
<ul>
<li>
<p><strong>Use case 1</strong>: The product team wants to simplify the search page. There are six filters but we don't know which ones users actually click. We need usage data to decide which to keep.</p>
</li>
<li>
<p><strong>Use case 2</strong>: Users report intermittent 500 errors on search. The errors are not constant and started two days ago. We need the error details to find the root cause.</p>
</li>
</ul>
<p>To bring that data into Cursor, we'll build two Agent Builder tools in Kibana and connect them via the <a href="https://www.elastic.co/docs/explore-analyze/ai-features/agent-builder/mcp-server">Elastic Agent Builder MCP Server</a>:</p>
<ul>
<li>
<p><code>get_filter_usage</code>: queries <code>traces-apm-default</code> for filter click events and returns a usage breakdown by filter name</p>
</li>
<li>
<p><code>get_recent_errors</code>: queries <code>logs-apm.error-default</code> for the most recent error groups for a given service, including the exception message and stack trace culprit</p>
</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-mcp-server-cursor-production-logs/architecture.png" alt="Architecture diagram showing Cursor connecting to the Elastic Agent Builder MCP server, which queries Elasticsearch APM data" /></p>
<p>For a deeper look at the overall architecture, see the <a href="https://www.elastic.co/search-labs/blog/agent-builder-mcp-reference-architecture-elasticsearch">Agent Builder reference guide</a>.  </p>
<h2>Setting up the Elastic MCP Server  </h2>
<h3>Step 1: Create the Agent Builder tools</h3>
<p>We create both tools via the <a href="https://www.elastic.co/docs/explore-analyze/ai-features/agent-builder/kibana-api">Kibana Agent Builder API</a>. Each tool is an ES|QL query with a name and description that Cursor uses to decide when to call it. The full implementation of the tools is in the following <a href="https://github.com/elastic/elasticsearch-labs/blob/main/supporting-blog-content/cursor-production-logs-elastic-mcp-server/notebook.ipynb"><code>notebook</code></a>.</p>
<h4>Tool 1: get_filter_usage</h4>
<p>The product team needs to know which filters users actually click before deciding which ones to remove. The query reads RUM interaction events from <code>traces-apm-default</code> and groups them by filter name:</p>
<pre><code>    {
      &quot;id&quot;: &quot;get_filter_usage&quot;,
      &quot;type&quot;: &quot;esql&quot;,
      &quot;description&quot;: &quot;Returns the usage count for each search filter in the ecommerce-search-ui service, sorted by most used first.&quot;,
      &quot;configuration&quot;: {
        &quot;query&quot;: &quot;FROM traces-apm-default | WHERE service.name == \&quot;ecommerce-search-ui\&quot; | WHERE transaction.type == \&quot;user-interaction\&quot; | WHERE labels.filter_name IS NOT NULL | STATS count = COUNT(*) BY labels.filter_name | SORT count DESC&quot;
      }
    }
</code></pre>
<h4>Tool 2: get_recent_errors</h4>
<p>For the error debugging use case, we need to surface the most frequent recent errors for a service, along with where in the code they originate. <code>STATS ... BY</code> groups errors by their fingerprint (<code>grouping_key</code>), surfaces the exception message and the line of code that caused it (<code>culprit</code>), and ranks by frequency:</p>
<pre><code>    {
      &quot;id&quot;: &quot;get_recent_errors&quot;,
      &quot;type&quot;: &quot;esql&quot;,
      &quot;description&quot;: &quot;Returns the most frequent error groups for ecommerce-search-ui, ranked by occurrence count, with the exception message and code location.&quot;,
      &quot;configuration&quot;: {
        &quot;query&quot;: &quot;FROM logs-apm.error-default | WHERE service.name == \&quot;ecommerce-search-ui\&quot; | WHERE processor.name == \&quot;error\&quot; | STATS count = COUNT(*) BY error.grouping_key, error.exception.0.message, error.culprit | SORT count DESC | LIMIT 5&quot;
      }
    }
</code></pre>
<p>Both tools are created with <code>POST /api/agent_builder/tools</code>. You can learn more about the Kibana API endpoints for Elastic Agent Builder <a href="https://www.elastic.co/docs/explore-analyze/ai-features/agent-builder/kibana-api">here</a>.</p>
<h3>Step 2: Connect to Cursor</h3>
<p>Open <code>~/.cursor/mcp.json</code> and add the Elastic server. For detailed information, see the Cursor <a href="https://cursor.com/docs/mcp#using-mcpjson">documentation</a>. The Agent Builder MCP endpoint uses Server-Sent Events (SSE) transport, so we connect via <code>mcp-remote</code>, a lightweight bridge that Cursor invokes as a local process:</p>
<pre><code>    {
      &quot;mcpServers&quot;: {
        &quot;elastic-agent-builder&quot;: {
          &quot;command&quot;: &quot;npx&quot;,
          &quot;args&quot;: [
            &quot;-y&quot;,
            &quot;mcp-remote&quot;,
            &quot;https://YOUR_KIBANA_URL/api/agent_builder/mcp&quot;,
            &quot;--header&quot;,
            &quot;Authorization: ApiKey YOUR_API_KEY&quot;
          ]
        }
      }
    }
</code></pre>
<p>Replace <code>YOUR_KIBANA_URL</code> and <code>YOUR_API_KEY</code> with your values.</p>
<p>Restart Cursor, open the Agent panel, and confirm that <code>get_filter_usage</code> and <code>get_recent_errors</code> appear in the available tools list. </p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-mcp-server-cursor-production-logs/cursor-mcp-tools.png" alt="Cursor MCP panel showing the get_filter_usage and get_recent_errors tools available from the Elastic Agent Builder server" /></p>
<h2>Use case 1: Data-driven UI optimization</h2>
<p>The eCommerce search page has six filters: category, manufacturer, price range, customer gender, day of week, and region. The product team wants to simplify the UI by removing filters that users don't use as much. Rather than guessing, we ask Cursor to check.</p>
<p>When you type a prompt in Cursor's Agent panel, the model sees the name and description of every connected MCP tool. It matches your intent to the best-fitting tool and calls it automatically. This is why the <code>description</code> field we set in Step 1 matters: it's what the model reads to decide which tool answers your question. If you are interested in learning more about Cursor’s MCP tools management, read the following <a href="https://cursor.com/docs/mcp#using-mcp-in-chat">documentation</a>.</p>
<p>Open a Cursor chat and ask: &quot;Show me how often each search filter is used.&quot; Cursor calls the tool and returns something like:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-mcp-server-cursor-production-logs/filter-usage-chart.png" alt="Filter usage breakdown returned by the get_filter_usage tool" /></p>
<p>The category and manufacturer filters get most of the clicks. The bottom three filters (<code>customer_gender</code>, <code>day_of_week</code>, <code>region</code>) are rarely used.</p>
<p>Ask Cursor to act on this: <strong><em>&quot;Based on this data, simplify the SearchFilters component. Keep the top 3 filters visible, collapse the others under a 'More filters' toggle.&quot;</em></strong></p>
<p>Cursor opens <code>src/components/SearchFilters.jsx</code>, reads the current implementation, and proposes the change.</p>
<p>Before: </p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-mcp-server-cursor-production-logs/search-filters-before.png" alt="SearchFilters component before the change, showing all six filters" /></p>
<p>After: </p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-mcp-server-cursor-production-logs/search-filters-after.png" alt="SearchFilters component after the change, showing the top three filters with the rest collapsed under a More filters toggle" /></p>
<p>The entire loop took one chat prompt. The decision was backed by production data, not a team discussion about what users probably care about.</p>
<h2>Use case 2: Production error debugging</h2>
<p>A bug report comes in: intermittent 500 errors on the search endpoint. The errors started appearing two days ago but they're not constant. The developer opens Cursor and asks: &quot;Show me what errors ecommerce-search-ui is throwing.&quot;</p>
<p>Cursor calls the tool and returns the error groups:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-mcp-server-cursor-production-logs/recent-errors.png" alt="Most recent error groups returned by the get_recent_errors tool" /></p>
<p>The error message is explicit: <code>category</code> is a text field and can't be used in terms of aggregation. The correct field is <code>category.keyword</code>. </p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-mcp-server-cursor-production-logs/error-fix-diff.png" alt="Cursor proposing the fix that changes category to category.keyword in the ES|QL query" /></p>
<p>With APM data available alongside your code, the debugging session becomes a conversation: you describe the symptom, the agent pulls the relevant logs, and you work through what's happening together. You can ask follow-up questions, check whether the error correlates with a recent deployment, or ask which endpoints are most affected, all within the same context where you'll make the fix. If you want to go further, Elastic also provides <a href="https://www.elastic.co/docs/solutions/observability/ai/agent-builder-observability">pre-built observability tools in Agent Builder</a> that you can use alongside custom tools like the ones we created here. For a complementary approach to AI-driven observability, see <a href="https://www.elastic.co/observability-labs/blog/ai-observability-web-agents-openlit">how to monitor web AI agents with OpenLIT and Elastic</a>.</p>
<h2>Conclusion</h2>
<p>What we covered:</p>
<ul>
<li>
<p>How to create Agent Builder tools in Kibana that wrap APM data queries</p>
</li>
<li>
<p>How to connect the Elastic Agent Builder MCP Server to Cursor in three lines of JSON</p>
</li>
<li>
<p>Using production telemetry to make a UI decision backed by real usage data</p>
</li>
<li>
<p>Debugging a production error from the same window where you fix it</p>
</li>
</ul>
<p>These two use cases are a starting point. The same pattern works for any data you have in Elasticsearch: performance metrics, A/B test results, audit logs, feature flag usage, user session data. Define the Agent Builder tool, connect it via MCP, and it becomes part of your development context in Cursor. For other examples of what's possible, see <a href="https://www.elastic.co/observability-labs/blog/mcp-elastic-synthetics">automating synthetic monitoring with MCP</a> and <a href="https://www.elastic.co/observability-labs/blog/agentic-cicd-kubernetes-mcp-server">agentic CI/CD deployment gates</a>.</p>
<h2>Next steps</h2>
<ul>
<li>
<p><a href="https://www.elastic.co/docs/explore-analyze/ai-features/agent-builder/mcp-server">Elastic Agent Builder MCP server documentation</a></p>
</li>
<li>
<p><a href="https://modelcontextprotocol.io/">Model Context Protocol specification</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/docs/explore-analyze/ai-features/agent-builder/">Elastic Agent Builder overview</a></p>
</li>
</ul>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/elastic-mcp-server-cursor-production-logs/header.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Introducing Streams for Observability: Your first stop for investigations]]></title>
            <link>https://www.elastic.co/observability-labs/blog/elastic-observability-streams-ai-logs-investigations</link>
            <guid isPermaLink="false">elastic-observability-streams-ai-logs-investigations</guid>
            <pubDate>Mon, 27 Oct 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Introducing Elastic Streams, an new AI observability feature that transforms logs from a noisy and expensive data source into a primary investigation signal.]]></description>
            <content:encoded><![CDATA[<p>We're excited to introduce Streams, a new AI capability within Elastic Observability. Built on the Elasticsearch platform, it's designed for Site Reliability Engineers (SREs) to use logs as the primary signal for investigations, enabling faster answers and quicker issue resolution. For decades, logs have been considered too noisy, expensive, and complex to manage, and many observability vendors have treated them as a second-class citizen. Streams flips this script by transforming raw logs into your most valuable asset to immediately identify not only the root cause, but also the why behind the root cause to enable instant resolution.</p>
<p>SREs today identify the &quot;what&quot; with metrics and the &quot;where&quot; with traces, which are important for troubleshooting. However, it's often the &quot;why&quot; that's needed for faster and more accurate incident resolution. The crucial “why” is buried in your logs, but the massive volume and unstructured nature of logs in modern microservice environments have made them difficult to use effectively. This has forced teams into a difficult position, either spending countless hours building and maintaining complex data pipelines to tame the chaos or drop valuable log data to control costs and risk critical visibility gaps. As a result, when an incident occurs, SREs waste precious time manually hunting for clues and reverse-engineering data instead of quickly resolving the issue.</p>
<h2>Streams, from ingest to answers with logs</h2>
<p>Streams directly addresses this challenge by using AI to transform the chaos of raw logs into your clearest path to a solution, enabling logs to be the primary signal for investigations. It processes raw logs at scale ingested from any source and in any format (structured and unstructured), then partitions, parses, and helps manage retention and data quality. Streams reduces the need for SREs to constantly normalize data, manage custom schemas, or sift through endless noise. Streams also surfaces Significant Events, like major errors and anomalies, enabling you to be proactive in your investigations. SREs can now focus on resolving issues faster than ever by spending less time on data management and hunting through the noise.</p>
<p>Lets see Streams in action. In the demo below, watch an SRE tackle an issue with a critical trading application in production. In minutes, Streams processes the raw logs, pinpoints a Java out-of-memory error, and the AI Assistant guides the SRE straight to the root cause, turning hours of manual work into a quick fix.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-observability-streams-ai-logs-investigations/streams-in-action.gif" alt="Streams in Action" /></p>
<p>Let's walk through some of the key Streams capabilities highlighted in the video:</p>
<ul>
<li><strong>AI-based partitioning</strong> - simplifies ingest by allowing SREs to send all logs to a single endpoint, without worrying about agents or integrations. Our AI automatically determines that logs are coming from two different systems, Hadoop and Spark. As more data comes through, it continues to learn and identify additional components, making segmentation effortless.</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-observability-streams-ai-logs-investigations/image-01.png" alt="AI-based partitioning" /></p>
<ul>
<li><strong>AI-based parsing</strong> - eliminates the manual effort of building and managing log processing pipelines.  In the demo Streams automatically detects logs from Spark and generates a GROK rule that perfectly parses 100% of the fields.</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-observability-streams-ai-logs-investigations/image-02.png" alt="AI-based parsing" /></p>
<ul>
<li><strong>Identifying Significant Events</strong> -  Cuts through the noise so you can focus immediately on key issues. Streams analyzes the parsed Spark logs and pinpoints the Java out-of-memory errors and exceptions. This provides SREs with a clear, actionable starting point for their investigations instead of forcing them to hunt through raw data.</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-observability-streams-ai-logs-investigations/image-03.png" alt="Significant Events" /></p>
<ul>
<li><strong>AI Assistant</strong> - The AI Assistant provides instant root cause analysis, turning hours of work into immediate answers. After Streams identifies the Java OOM error, an SRE can analyze logs in Discover with the AI Assistant. Within moments, it determines the root cause is that Spark lacks sufficient memory for the datasets being processed, delivering a precise answer to guide remediation.</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-observability-streams-ai-logs-investigations/image-04.png" alt="AI Assistant" /></p>
<p>One item that isn't in the video, is how easy Streams makes logs ingest. In this example above, we used the OTel collector, and merely configured it with a processor, exporter and service statements in values.yaml file for the OTel Collector's helm chart:</p>
<pre><code class="language-bash">processors:
  transform/logs-streams:
      log_statements:
        - context: resource
          statements:
            - set(attributes[&quot;elasticsearch.index&quot;], &quot;logs&quot;)
exporters:
  debug:
  otlp/ingest:
    endpoint: ${env:ELASTIC_OTLP_ENDPOINT}
    headers:
      Authorization: ApiKey ${env:ELASTIC_API_KEY}

service:
  pipelines:
      logs:
        receivers: [filelog]
        processors: [batch, transform/logs-streams]
        exporters: [elasticsearch, debug]
</code></pre>
<p>With Streams you can use any log forwarder, OTel Collector (as in the example above), fluentd, fluentbit, etc. This makes ingesting simple and ensures you aren't locked into any specific log forwarder for Elastic.</p>
<p>As you've seen in this example, Streams helps SREs focus on finding the “why”, without the manual, error-prone work of making logs usable. What used to happen in hours can now be accomplished in minutes.</p>
<h2>Streams: Key Features and availability</h2>
<p>While the previous example shows how easy and fast it is to get to RCA with partitioning, parsing, Significant events, and the AI Assistant, Streams has more capabilities which is highlighted in the following diagram:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-observability-streams-ai-logs-investigations/image-05.png" alt="Streams" /></p>
<p>All of these capabilities are available in two primary modes: Streams for data already indexed in Elasticsearch, and Logs Streams for ingesting raw logs directly. Both modes support AI-driven partitioning and parsing, the identification of Significant Events, and essential tools for managing data quality, retention, and cost-efficient storage.</p>
<p><strong>Streams (GA in 9.2)</strong></p>
<p>Provides foundational capabilities that reduce pipeline management for SREs. Streams works with logs from existing agents and integrations as well as raw, unstructured logs coming through Logs Streams. Key capabilities include:</p>
<ul>
<li>
<p>Streams Processing: simulate and refine log parsing using AI-powered Parsing or a point-and-click UI. Compare before-and-after states and modify schemas to simplify log processing.</p>
</li>
<li>
<p>Streams Retention Management: define time-based or advanced ILM policies directly in the UI, gain visibility into ingestion volume, and manage data in the failure store..</p>
</li>
<li>
<p>Streams Data Quality: detect and fix ingestion failures via a failure store that captures and exposes failed documents for inspection.</p>
</li>
</ul>
<p><strong>Logs Streams (Tech Preview)</strong></p>
<p>Enables SREs to ingest any log, in any format, directly into Elasticsearch, without the need for agents or integrations. Key capabilities include:</p>
<ul>
<li>
<p>Direct Ingestion with any log forwarder into Elasticsearch: send raw logs directly into /logs index using any mechanism, such as the logs_index parameter in an OpenTelemetry collector.</p>
</li>
<li>
<p>AI-Driven Partitioning: automatically or manually segment a single log stream into distinct parts (e.g., by service or component) using contextual AI-based suggestions..</p>
</li>
</ul>
<p><strong>Significant Events (tech preview)</strong></p>
<p>Significant Events is available in both Streams and Logs Streams, and surfaces errors and anomalies that truly matter, such as startup and shutdown messages, out-of-memory errors, internal server failures, and other signals of change. These events act as actionable markers, giving SREs early warning and an investigative starting point before a service impact occurs.</p>
<h2>What does this mean for SREs in practice?</h2>
<p>With Elastic Streams, SREs no longer need to spend time data wrangling before they can be investigators. Logs are the primary investigation signal because Streams provides SREs with the ability to:</p>
<ul>
<li><strong>Log everything in any format, and don't worry about pipelines</strong> - Stop wasting time building and maintaining complex ingestion pipelines. Send logs in any format, structured or unstructured, from any source directly to a single Elastic endpoint, without needing specific agents. Use OTel collectors or any other data shipper to send logs to Elastic. Streams AI-driven processing parses and structures your log data, making it immediately “ready for investigation”. This means you can adapt to new log formats on the fly without the need to maintain brittle configurations. Streams ensures you always have the data you need, the moment you need it.</li>
</ul>
<ul>
<li><strong>Don't just collect logs, get answers from them</strong> - Streams analyzes your data to surface “Significant Events,” proactively identifying critical errors, anomalies, and performance bottlenecks like out-of-memory exceptions. Instead of manually sifting through terabytes of data, you get a clear, prioritized starting point for your investigation. This allows you to go from symptom to solution in minutes, fixing issues before they impact users.</li>
</ul>
<ul>
<li><strong>Achieve Complete Visibility at a Lower Cost:</strong> Get comprehensive visibility across all your services without the expected expense. By intelligently structuring data and surfacing only the most critical events, Streams reduces operational complexity and dramatically cuts down root cause analysis time. This efficiency allows you to store all relevant log data cost-effectively, ensuring you never have to sacrifice crucial visibility to meet a budget. Get clearer answers faster and lower your total cost of ownership.</li>
</ul>
<h2>Conclusion</h2>
<p>Elastic Streams revolutionizes observability by transforming logs from a noisy and expensive data source into a primary investigation signal. Through AI-powered capabilities like automatic partitioning, parsing, retention management, and the surfacing of Significant Events, Streams empowers SREs to move beyond data management and directly pinpoints the root cause of issues. By reducing operational complexity, lowering storage costs, and providing complete visibility, Streams ensures that logs, enriched by AI become the fastest path to resolution by answering the critical question “why” for observability.</p>
<p>Sign up for an Elastic trial at <a href="http://cloud.elastic.co">cloud.elastic.co</a>, and trial Elastic's Serverless offering which will allow you to play with all of the Streams functionality.</p>
<p>Additionally, check out:</p>
<p><em>Read about</em> <a href="https://www.elastic.co/observability-labs/blog/reimagine-observability-elastic-streams"><em>Reimagining streams</em></a></p>
<p><em>Look at the</em> <a href="http://elastic.co/elasticsearch/streams"><em>Streams website</em></a></p>
<p><em>Read the</em> <a href="https://www.elastic.co/docs/solutions/observability/streams/streams"><em>Streams documentation</em></a></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/elastic-observability-streams-ai-logs-investigations/streams-launch.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Elastic SQL inputs: A generic solution for database metrics observability]]></title>
            <link>https://www.elastic.co/observability-labs/blog/sql-inputs-database-metrics-observability</link>
            <guid isPermaLink="false">sql-inputs-database-metrics-observability</guid>
            <pubDate>Mon, 11 Sep 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[This blog dives into the functionality of generic SQL and provides various use cases for advanced users to ingest custom metrics to Elastic for database observability. We also introduce the fetch from all database new capability released in 8.10.]]></description>
            <content:encoded><![CDATA[<p>Elastic&lt;sup&gt;®&lt;/sup&gt; SQL inputs (<a href="https://www.elastic.co/guide/en/beats/metricbeat/current/metricbeat-module-sql.html">metricbeat</a> module and <a href="https://docs.elastic.co/integrations/sql">input package</a>) allows the user to execute <a href="https://en.wikipedia.org/wiki/SQL">SQL</a> queries against many supported databases in a flexible way and ingest the resulting metrics to Elasticsearch&lt;sup&gt;®&lt;/sup&gt;. This blog dives into the functionality of generic SQL and provides various use cases for <em>advanced users</em> to ingest custom metrics to Elastic&lt;sup&gt;®&lt;/sup&gt;, for database observability. The blog also introduces the fetch from all database new capability, released in 8.10.</p>
<h2>Why “Generic SQL”?</h2>
<p>Elastic already has metricbeat and integration packages targeted for specific databases. One example is <a href="https://www.elastic.co/guide/en/beats/metricbeat/current/metricbeat-module-mysql.html">metricbeat</a> for MySQL — and the corresponding integration <a href="https://docs.elastic.co/en/integrations/mysql">package</a>. These beats modules and integrations are customized for a specific database, and the metrics are extracted using pre-defined queries from the specific database. The queries used in these integrations and the corresponding metrics are <em>not</em> available for modification.</p>
<p>Whereas the <em>Generic SQL inputs</em> (<a href="https://www.elastic.co/guide/en/beats/metricbeat/current/metricbeat-module-sql.html">metricbeat</a> or <a href="https://docs.elastic.co/integrations/sql">input package</a>) can be used to scrape metrics from any supported database using the user's SQL queries. The queries are provided by the user depending on specific metrics to be extracted. This enables a much more powerful mechanism for metrics ingestion, where users can choose a specific driver and provide the relevant SQL queries and the results get mapped to one or more Elasticsearch documents, using a structured mapping process (table/variable format explained later).</p>
<p>Generic SQL inputs can be used in conjunction with the existing integration packages, which already extract specific database metrics, to extract additional custom metrics dynamically, making this input very powerful. In this blog, <em>Generic SQL input</em> and <em>Generic SQL</em> are used interchangeably.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/sql-inputs-database-metrics-observability/elastic-blog-1-genericSQL.png" alt="Generic SQL database metrics collection" /></p>
<h2>Functionalities details</h2>
<p>This section covers some of the features that would help with the metrics extraction. We provide a brief description of the response format configuration. Then we dive into the merge_results functionality, which is used to combine results from multiple SQL queries into a single document.</p>
<p>The next key functionality users may be interested in is to collect metrics from all the custom databases, which is now possible with the fetch_from_all_databases feature.</p>
<p>Now let's dive into the specific functionalities:</p>
<h3>Different drivers supported</h3>
<p>The generic SQL can fetch metrics from the different databases. The current version has the capability to fetch metrics from the following drivers: MySQL, PostgreSQL, Oracle, and Microsoft SQL Server(MSSQL).</p>
<h3>Response format</h3>
<p>The response format in generic SQL is used to manipulate the data in either table or in variable format. Here’s an overview of the formats and syntax for creating and using the table and variables.</p>
<p>Syntax: <code>response_format: table {{or}} variables</code></p>
<p><strong>Response format table</strong><br />
This mode generates a single event for each row. The table format has no restrictions on the number of columns in the response. This format can have any number of columns.</p>
<p>Example:</p>
<pre><code class="language-sql">driver: &quot;mssql&quot;
sql_queries:
 - query: &quot;SELECT counter_name, cntr_value FROM sys.dm_os_performance_counters WHERE counter_name= 'User Connections'&quot;
   response_format: table
</code></pre>
<p>This query returns a response similar to this:</p>
<pre><code class="language-json">&quot;sql&quot;:{
      &quot;metrics&quot;:{
         &quot;counter_name&quot;:&quot;User Connections &quot;,
         &quot;cntr_value&quot;:7
      },
      &quot;driver&quot;:&quot;mssql&quot;
}
</code></pre>
<p>The response generated above adds the counter_name as a key in the document.</p>
<p><strong>Response format variables</strong><br />
The variable format supports key:value pairs. This format expects only two columns to fetch in a query.</p>
<p>Example:</p>
<pre><code class="language-sql">driver: &quot;mssql&quot;
sql_queries:
 - query: &quot;SELECT counter_name, cntr_value FROM sys.dm_os_performance_counters WHERE counter_name= 'User Connections'&quot;
   response_format: variables
</code></pre>
<p>The variable format takes the first variable in the query above as the key:</p>
<pre><code class="language-json">&quot;sql&quot;:{
      &quot;metrics&quot;:{
         &quot;user connections &quot;:7
      },
      &quot;driver&quot;:&quot;mssql&quot;
}
</code></pre>
<p>In the above response, you can see the value of counter_name is used to generate the key in variable format.</p>
<h3>Response optimization: merge_results</h3>
<p>We are now supporting merging multiple query responses into a single event. By enabling <strong>merge_results</strong> , users can significantly optimize the storage space of the metrics ingested to Elasticsearch. This mode enables an efficient compaction of the document generated, where instead of generating multiple documents, a single merged document is generated wherever applicable. The metrics of a similar kind, generated from multiple queries, are combined into a single event.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/sql-inputs-database-metrics-observability/elastic-blog-2-output-merge-results.png" alt="Output of Merge results" /></p>
<p>Syntax: <code>merge_results: true {{or}} false</code></p>
<p>In the below example, you can see how the data is loaded into Elasticsearch for the below query when the merge_results is disabled.</p>
<p>Example:</p>
<p>In this example, we are using two different queries to fetch metrics from the performance counter.</p>
<pre><code class="language-yaml">merge_results: false
driver: &quot;mssql&quot;
sql_queries:
  - query: &quot;SELECT cntr_value As 'user_connections' FROM sys.dm_os_performance_counters WHERE counter_name= 'User Connections'&quot;
    response_format: table
  - query: &quot;SELECT cntr_value As 'buffer_cache_hit_ratio' FROM sys.dm_os_performance_counters WHERE counter_name = 'Buffer cache hit ratio' AND object_name like '%Buffer Manager%'&quot;
    response_format: table
</code></pre>
<p>As you can see, the response for the above example generates a single document for each query.</p>
<p>The resulting document from the first query:</p>
<pre><code class="language-json">&quot;sql&quot;:{
      &quot;metrics&quot;:{
         &quot;user_connections&quot;:7
      },
      &quot;driver&quot;:&quot;mssql&quot;
}
</code></pre>
<p>And resulting document from the second query:</p>
<pre><code class="language-json">&quot;sql&quot;:{
      &quot;metrics&quot;:{
         &quot;buffer_cache_hit_ratio&quot;:87
      },
      &quot;driver&quot;:&quot;mssql&quot;
}
</code></pre>
<p>When we enable the merge_results flag in the query, both the above metrics are combined together and the data gets loaded in a single document.</p>
<p>You can see the merged document in the below example:</p>
<pre><code class="language-json">&quot;sql&quot;:{
      &quot;metrics&quot;:{
         &quot;user connections &quot;:7,
         “buffer_cache_hit_ratio”:87
      },
      &quot;driver&quot;:&quot;mssql&quot;
}
</code></pre>
<p><em>However, such a merge is possible only if the table queries are merged, and each produces a single row. There is no restriction on variable queries being merged.</em></p>
<h3>Introducing a new capability: fetch_from_all_databases</h3>
<p>This is a <a href="https://github.com/elastic/beats/pull/35688">new functionality</a> to fetch all the database metrics automatically from the system and user databases of the Microsoft SQL Server, by enabling the fetch_from_all_databases flag.</p>
<p>Keep an eye out for the <a href="https://www.elastic.co/guide/en/beats/metricbeat/8.10/metricbeat-module-sql.html#_example_execute_given_queries_for_all_databases_present_in_a_server">8.10 release version</a> where you can start using the fetch all database feature. Prior to the 8.10 version, users had to provide the database names manually to fetch metrics from custom/user databases.</p>
<p>Syntax: <code>fetch_from_all_databases: true {{or}} false</code></p>
<p>Below is the sample query with fetch all databases flag as disabled:</p>
<pre><code class="language-yaml">fetch_from_all_databases: false
driver: &quot;mssql&quot;
sql_queries:
  - query: &quot;SELECT @@servername AS server_name, @@servicename AS instance_name, name As 'database_name', database_id FROM sys.databases WHERE name='master';&quot;
</code></pre>
<p>The above query fetches metrics only for the provided database name. Here the input database is master, so the metrics are fetched only for the master.</p>
<p>Below is the sample query with the fetch all databases flag as enabled:</p>
<pre><code class="language-yaml">fetch_from_all_databases: true
driver: &quot;mssql&quot;
sql_queries:
  - query: SELECT @@servername AS server_name, @@servicename AS instance_name, DB_NAME() AS 'database_name', DB_ID() AS database_id;
    response_format: table
</code></pre>
<p>The above query fetches metrics from all available databases. This is useful when the user wants to get data from all the databases.</p>
<p>Please note: currently this feature is supported only for Microsoft SQL Server and will be used by MS SQL integration internally, to support extracting metrics for <a href="https://github.com/elastic/integrations/issues/4108">all user DBs</a> by default.</p>
<h2>Using generic SQL: Metricbeat</h2>
<p>The generic <a href="https://www.elastic.co/guide/en/beats/metricbeat/current/metricbeat-module-sql.html">SQL metricbeat module</a> provides flexibility to execute queries against different database drivers. The metricbeat input is available as GA for any production usage. <a href="https://www.elastic.co/guide/en/beats/metricbeat/current/metricbeat-module-sql.html">Here</a>, you can find more information on configuring <a href="https://www.elastic.co/guide/en/beats/metricbeat/current/metricbeat-module-sql.html">the generic SQL</a> for different drivers with various examples.</p>
<h2>Using generic SQL: Input package</h2>
<p>The input package provides a flexible solution to advanced users for customizing their ingestion experience in Elastic. Generic SQL is now also available as an SQL<a href="https://docs.elastic.co/integrations/sql">input package</a>. The input package is currently available for early users as a <strong>beta release</strong>. Let's take a walk through how users can use generic SQL via the input package.</p>
<h3>Configurations of generic SQL input package:</h3>
<p>The configuration options for the generic SQL input package are as below:</p>
<ul>
<li><strong>Driver**</strong> :** This is the SQL database for which you want to use the package. In this case, we will take mysql as an example.</li>
<li><strong>Hosts:</strong> Here the user enters the connection string to connect to the database. It would vary depending on which database/driver is being used. Refer <a href="https://docs.elastic.co/integrations/sql#hosts">here</a> for examples.</li>
<li><strong>SQL Queries:</strong> Here the user writes the SQL queries they want to fire and the response_format is specified.</li>
<li><strong>Data set:</strong> The user specifies a <a href="https://www.elastic.co/guide/en/ecs/master/ecs-data_stream.html#_data_stream_field_details">data set</a> name to which the response fields get mapped.</li>
<li><strong>Merge results**</strong> :** This is an advanced setting, used to merge queries into a single event.</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/sql-inputs-database-metrics-observability/elastic-blog-3-SQL-metrics-inputpackage.png" alt="Configuration parameters for SQL input package" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/sql-inputs-database-metrics-observability/elastic-blog-4-expanded-document.png" alt="Metrics getting mapped to the index created by the ‘sql_first_dataset’" /></p>
<h3>Metrics extensibility with customized SQL queries</h3>
<p>Let's say a user is using <a href="https://docs.elastic.co/integrations/mysql">MYSQL Integration</a>, which provides a fixed set of metrics. Their requirement now extends to retrieving more metrics from the MYSQL database by firing new customized SQL queries.</p>
<p>This can be achieved by adding an instance of SQL input package, writing the customized queries and specifying a new <a href="https://www.elastic.co/guide/en/ecs/master/ecs-data_stream.html#field-data-stream-dataset">data set</a> name as shown in the screenshot below.</p>
<p>This way users can get any metrics by executing corresponding queries. The resultant metrics of the query will be indexed to the new data set, sql_second_dataset.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/sql-inputs-database-metrics-observability/elastic-blog-5-driver.png" alt="Customization of Ingest Pipelines and Mappings" /></p>
<p>When there are multiple queries, users can club them into a single event by enabling the Merge Results toggle.</p>
<h3>Customizing user experience</h3>
<p>Users can customize their data by writing their own <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/ingest.html">ingest pipelines</a> and providing their customized <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html">mappings</a>. Users can also build their own bespoke dashboards.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/sql-inputs-database-metrics-observability/elastic-blog-6-ingest-pipeline.png" alt="Customization of Ingest Pipelines and Mappings" /></p>
<p>As we can see above, the SQL input package provides the flexibility to get new metrics by running new queries, which are not supported in the default MYSQL integration (the user gets metrics from a predetermined set of queries).</p>
<p>The SQL input package also supports multiple drivers: mssql, postgresql and oracle. So a single input package can be used to cater to all these databases.</p>
<p>Note: The fetch_from_all_databases feature is not supported in the SQL input package yet.</p>
<h2>Try it out!</h2>
<p>Now that you know about various use cases and features of generic SQL, get started with <a href="https://cloud.elastic.co/registration?fromURI=/home">Elastic Cloud</a> and try using the <a href="https://docs.elastic.co/integrations/sql">SQL input package</a> for your SQL database and get customized experience and metrics. If you are looking for newer metrics for some of our existing SQL based integrations — like <a href="https://docs.elastic.co/en/integrations/microsoft_sqlserver">Microsoft SQL Server</a>, <a href="https://docs.elastic.co/integrations/oracle">Oracle</a>, and more — go ahead and give the SQL input package a swirl.</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/sql-inputs-database-metrics-observability/patterns-midnight-background-no-logo-observability.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Smarter Alerting Arrives with Faster Triage, Clearer Groupings, and Actionable Guidance]]></title>
            <link>https://www.elastic.co/observability-labs/blog/elastic-stack-observability-alerting-upgrade</link>
            <guid isPermaLink="false">elastic-stack-observability-alerting-upgrade</guid>
            <pubDate>Thu, 04 Sep 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Exploring the latest enhancements in Elastic Stack alerting, including improved related alert grouping, linking dashboards to alert rules, and embedding investigation guides into alerts.]]></description>
            <content:encoded><![CDATA[<p>In the 9.1 release, we've made significant upgrades to alerting to help SREs and operators cut through the noise, understand what's happening faster, and take meaningful action with less guesswork.</p>
<p>Here's what's new:</p>
<h2>Improved Related Alert Grouping with Relevance Scoring &amp; Reasoning</h2>
<p>We've enhanced our related alert detection to go beyond surface-level correlations. Alerts are now grouped based on a relevance score that reflects the strength of their relationship across dimensions like:</p>
<ul>
<li><strong>Shared entities or resources</strong> (e.g. same host, pod, or service)</li>
<li><strong>Temporal proximity</strong> (alerts firing within a suspiciously short window)</li>
<li><strong>Signal similarity</strong> (e.g. spikes in logs, metrics, and traces that point to the same failure mode)</li>
</ul>
<p>More importantly, we now <strong>show the why</strong>. You'll see why an alert is grouped, whether it's sharing the same Kubernetes pod, has similar log patterns, or was triggered by the same upstream anomaly. This gives users confidence in the grouping logic and accelerates root cause analysis.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-stack-observability-alerting-upgrade/alerting-1.jpg" alt="Related Alerts" /></p>
<h2>Link Dashboards to Alert Rules and Get Smart Suggestions</h2>
<p>You can now <strong>link dashboards directly to your alert rules</strong>, giving responders an instant visual lens into the metrics or logs that matter most for that alert. No more scrambling to remember which dashboard to check — just click and go.</p>
<p>And we've made this smarter too: Elastic will now <strong>suggest relevant dashboards</strong> based on the alert's source, rule logic, or monitored entities, helping users land on the right view without needing to configure anything upfront.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-stack-observability-alerting-upgrade/alerting-2.jpg" alt="Related Alerting Dashboards" /></p>
<h2>Investigation Guides Embedded Into Alerts</h2>
<p>Every alert can now be configured with an <strong>investigation guide</strong>, a set of pre-configured, context-aware instructions or next steps tailored to the alert. Think of it as a playbook that's embedded right where and when you need it.</p>
<p>Use it to:</p>
<ul>
<li>Document your team's runbooks and standard triage steps or link to existing runbooks</li>
<li>Guide junior engineers or on-call responders through unfamiliar territory</li>
<li>Automate the first few steps of root cause analysis</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-stack-observability-alerting-upgrade/alerting-3.jpg" alt="Investigation Guide" /></p>
<h2>Why This Matters</h2>
<p>These changes are all about reducing time to detect (MTTD) and time to resolve (MTTR). By:</p>
<ul>
<li>Grouping alerts more intelligently (and transparently)</li>
<li>Giving you the dashboards you need, when you need them</li>
<li>Embedding action-oriented guides in every alert</li>
</ul>
<p>We're bringing you closer to a truly streamlined incident response workflow; No swivel-chairing, no guesswork, just clarity.</p>
<p>Additionally, look at some of our other articles on Elastic Observability Labs related to analysis:</p>
<ul>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/ai-assistant">Using the AI Assistant in Elastic Observability to Accelerate Root Cause Analysis</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/log-analytics">All of the log analytics features in Elastic Observability</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/opentelemetry">Our latest on OpenTelemetry support in Elastic Observability</a></p>
</li>
</ul>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/elastic-stack-observability-alerting-upgrade/cover-alerting.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[How Streams Generates a Log Pipeline in Seconds]]></title>
            <link>https://www.elastic.co/observability-labs/blog/elastic-streams-ai-pipeline-generation</link>
            <guid isPermaLink="false">elastic-streams-ai-pipeline-generation</guid>
            <pubDate>Thu, 16 Apr 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Streams generates a complete, tested log processing pipeline from a single click. Here's the two-stage mechanism behind it: deterministic fingerprinting, a reasoning agent that iterates against real data, and hard validation thresholds that enforce quality before you see the result.]]></description>
            <content:encoded><![CDATA[<p>Just click the Suggest pipeline button in Kibana's Processing tab and within a few seconds you're looking at a complete pipeline (Grok pattern, date normalization, type conversions) with a preview of how your actual log documents parse through it.</p>
<p>The alternative is doing this by hand: write a Grok pattern, testing it, fixing the edge cases, realizing the field names don't match ECS, renaming them, adding a date processor. And all that is just the work for a single service.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-streams-ai-pipeline-generation/architecture-overview@2x.png" alt="Two-stage pipeline generation architecture" /></p>
<h2>The three jobs every log pipeline has</h2>
<p>Every log processing pipeline does the same three things: Things usually start with extracting fields from raw log messages, normalizing them to a consistent schema, and cleaning up whatever you don't need. Most teams would build and maintain these by hand, which can be challenging as log formats change and you realize that the person who wrote the Grok pattern moved teams, and nothing about the pipeline is documented except the pattern itself.</p>
<p>Every new service now means doing it again from scratch, with a different format, different edge cases, and eventually a different person maintaining a pattern they didn't write.</p>
<p>For the initial pipeline, Streams handles all three jobs automatically and validates the result before anything touches your production data.</p>
<h2>What happens when you click &quot;Suggest pipeline&quot;</h2>
<p>Open the Processing tab for a stream in Kibana. Click the button. Within seconds, the panel populates with a proposed pipeline (typically a parsing step, date normalization, type conversions, and field cleanup) along with a live preview showing what your most recent documents look like after the pipeline runs.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-streams-ai-pipeline-generation/pipeline.gif" alt="Streams pipeline generation in Kibana" /></p>
<p>In this view, you can see the exact fields that will be extracted, their types, and how many of your sample documents parsed successfully. If a field name is off, you can also edit it inline; if a step is adding noise, just remove it. And if the parse rate needs work, you can easily adjust and re-run generation. Nothing is written to the stream until you explicitly confirm. For now, at least, this is an important step for the human to be in the loop with these changes. As systems like these mature more, this may not be necessary in the future.</p>
<p>Let's walk through the steps in more detail.</p>
<h2>Stage 1: Log grouping and pattern extraction</h2>
<p>The first stage of our process doesn't involve a reasoning model. It's actually deterministic: the same input always produces the same output, with no variance from a model. It also scopes down what Stage 2 has to figure out.</p>
<p>Before any extraction runs, Streams clusters the messages by log format fingerprint. The algorithm is really simple too: digits map to <code>0</code>, letters map to <code>a</code>, and punctuation is preserved as-is. Two messages that produce the same fingerprint land in the same group.</p>
<pre><code># two entries from the same nginx stream
2026-03-30 14:22:31 192.168.1.100 - james &quot;GET /api/v1/health&quot; 200
2026-03-30 08:01:05 10.0.0.5      - alice &quot;GET /api/v2/status&quot; 404

# fingerprint
0-0-0 0:0:0 0.0.0.0 - a     &quot;a /a/a0/a&quot; 0
0-0-0 0:0:0 0.0.0.0 - a     &quot;a /a/a0/a&quot; 0
</code></pre>
<p>A stream with mixed log formats produces multiple groups, one per distinct format in the batch. This is a fairly simple but really effective way for us to cluster similar logs together and it makes all the other steps much more reliable.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-streams-ai-pipeline-generation/stage1-parallel-extraction@2x.png" alt="Stage 1: log grouping and per-group pattern extraction" /></p>
<p>Both Grok and Dissect run on the same input, though they work differently. Grok runs per group, as it supports multiple patterns and handles each distinct format independently. Dissect uses a single pattern, so it targets only the largest group in the batch.</p>
<p>For each candidate, a heuristic algorithm analyzes the messages and identifies field boundaries: what's fixed text and what varies. It generates a pattern with positional placeholder names. An LLM then reviews the extracted field positions against a sample of up to 10 messages and renames the placeholders to human-readable, schema-compliant names.</p>
<pre><code># grok heuristic output (positional placeholders)
%{IPV4:field_0} - %{USER:field_1} \[%{HTTPDATE:field_2}\] &quot;%{WORD:field_3} %{URIPATHPARAM:field_4}...&quot;

# after LLM field naming (ECS-aligned)
%{IPV4:source.ip} - %{USER:user.name} \[%{HTTPDATE:@timestamp}\] &quot;%{WORD:http.request.method} %{URIPATHPARAM:url.path}...&quot;

# dissect heuristic output (positional placeholders)
%{field_0} - %{field_1} [%{field_2}] &quot;%{field_3} %{field_4} %{?field_5}&quot; %{field_6} %{field_7}

# after LLM field naming (ECS-aligned)
%{source.ip} - %{user.name} [%{@timestamp}] &quot;%{http.request.method} %{url.path} %{?http_version}&quot; %{http.response.status_code} %{http.response.body.bytes}
</code></pre>
<p>The resulting processor is simulated against your submitted documents to measure its parse rate. Grok is a little more expressive, with typed fields, named captures, multiple sub-patterns. The big downside is that it's also slower. Dissect on the other hand is faster but limited to fixed-position splits. Simple log formats tend to parse cleanly with dissect; complex ones need grok.</p>
<p>The candidate with the higher parse rate becomes that group's parsing processor. This runs for every group in the batch. Stage 1 hands Stage 2 one parsing processor per group found.</p>
<p>For a batch of nginx access logs, the extraction produces two candidates for the one format group present:</p>
<pre><code># input (sampled from 300 submitted documents)
192.168.1.100 - james [30/Mar/2026:14:22:31 +0000] &quot;GET /api/v1/health HTTP/1.1&quot; 200 1234

# grok candidate → parse rate 94% (282/300)
%{IPV4:source.ip} - %{USER:user.name} \[%{HTTPDATE:@timestamp}\] &quot;%{WORD:http.request.method} %{URIPATHPARAM:url.path} HTTP/%{NUMBER:http.version}&quot; %{NUMBER:http.response.status_code:int} %{NUMBER:http.response.body.bytes:int}

# dissect candidate → parse rate 71% (213/300)
%{source.ip} - %{user.name} [%{@timestamp}] &quot;%{http.request.method} %{url.path} %{?http_version}&quot; %{http.response.status_code} %{http.response.body.bytes}

# winner: grok
</code></pre>
<p>Grok wins here because <code>%{HTTPDATE}</code> handles the bracketed timestamp format; Dissect tries to split on fixed positions and fails on the surrounding brackets. Both run in parallel; comparing their results adds negligible time since this intial simulation is only done on a sample of documents.</p>
<h2>Stage 2: The reasoning agent</h2>
<p>Stage 1 produces a parsing processor; Stage 2 turns it into a complete, validated pipeline.</p>
<p>This stage uses a reasoning agent that iterates through a loop with two tools, running up to six iterations.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-streams-ai-pipeline-generation/stage2-agent-loop@2x.png" alt="Stage 2 reasoning agent loop with hard validation thresholds" /></p>
<p>The loop:</p>
<ol>
<li>The agent takes the Stage 1 parsing processor and proposes additional steps: date normalization, type conversions, field cleanup, and PII masking for fields it identifies as sensitive.</li>
<li>It runs the complete proposed pipeline against your original documents (the raw data, not pre-processed) and returns validation results.</li>
<li>If the simulation fails, the agent reads the error messages and adjusts. The failures are very specific, and we're making good use of the LLMs capabilities to understand them: which processor failed, on what percentage of documents, with what error type. When the parse rate drops below 80%, the tool returns:</li>
</ol>
<pre><code>Parse rate is too low: 67.00% (minimum required: 80%). The pipeline is not
extracting fields from enough documents. Review the processors and ensure
they handle the document structure correctly.

Processor &quot;grok[0]&quot; has a failure rate of 33.00% (maximum allowed: 20%).
This processor is failing on too many documents.
</code></pre>
<p>The agent now reads the processor name, the failure rate, and the threshold, then adjusts the pattern on the next iteration. It can't commit until the errors resolve.</p>
<ol start="4">
<li>This repeats until the pipeline passes, then commits and sends for user approval in the UI.</li>
</ol>
<p>To ensure quality we enforce two hard thresholds at the tool level, not by the agent's judgment:</p>
<ul>
<li>If fewer than 80% of documents parse successfully, the simulation returns an error. The agent must fix this before proceeding.</li>
<li>If any individual processor fails on more than 20% of documents, the simulation is invalid.</li>
</ul>
<p>Validation is also embedded in the tool: the model sees an error message and must resolve it before proceeding. It can't commit a pipeline that fails these checks.</p>
<p>Under the hood we're steering the agent in a spefific direction. The system prompt here includes: &quot;Simplify first. Remove problematic processors rather than adding workarounds. A pipeline that handles 95% of documents perfectly is better than one that attempts 100% but fails unpredictably.&quot;</p>
<p>If your data is already well-structured (proper <code>@timestamp</code>, correct field types, no raw text that needs parsing), the agent detects this and commits an empty pipeline. It doesn't add processors for the sake of it.</p>
<h2>The output is Streamlang</h2>
<p>The agent writes Streamlang DSL, Elastic's processing language for streams, which compiles to ingest pipelines behind the scenes.</p>
<p>The field schema, the processor types, the step format: all expressed in Streamlang. Here's what the user-approved pipeline looks like for the nginx example above, targeting an ECS stream:</p>
<pre><code class="language-yaml">steps:
  - action: grok
    from: message
    patterns:
      - &quot;%{IPV4:source.ip} - %{USER:user.name} \\[%{HTTPDATE:@timestamp}\\] \&quot;%{WORD:http.request.method} %{URIPATHPARAM:url.path} HTTP/%{NUMBER:http.version}\&quot; %{NUMBER:http.response.status_code:int} %{NUMBER:http.response.body.bytes:int}&quot;
  - action: date
    from: &quot;@timestamp&quot;
    formats:
      - &quot;dd/MMM/yyyy:HH:mm:ss Z&quot;
  - action: convert
    from: http.response.status_code
    type: integer
  - action: remove
    from: message
</code></pre>
<h2>Two schemas, one generator</h2>
<p>Not everyone lands logs in the same shape, and Elastic needs to support a variety of formats. Teams running OpenTelemetry collectors want their data in OTel-native fields. Teams on Elastic's traditional stack expect ECS. Both are valid, and forcing everyone onto one schema would mean asking half our users to restructure their pipelines before they can even get started.</p>
<p>So Streams supports both, and the generator handles both. We automatically detect if we should use OTel or ECS here. For this we mostly look at the name of the stream and check if it contains <code>otel</code>, as that's what the current naming in our stack defaults to.</p>
<p>The pipeline looks different for each because the canonical field names differ:</p>
<table>
<thead>
<tr>
<th></th>
<th>OTel</th>
<th>ECS</th>
</tr>
</thead>
<tbody>
<tr>
<td>Log body</td>
<td><code>body.text</code></td>
<td><code>message</code></td>
</tr>
<tr>
<td>Log level</td>
<td><code>severity_text</code></td>
<td><code>log.level</code></td>
</tr>
<tr>
<td>Service name</td>
<td><code>resource.attributes.service.name</code></td>
<td><code>service.name</code></td>
</tr>
<tr>
<td>Host name</td>
<td><code>resource.attributes.host.name</code></td>
<td><code>host.name</code></td>
</tr>
</tbody>
</table>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-streams-ai-pipeline-generation/otel-vs-ecs@2x.png" alt="OTel vs ECS stream pipeline comparison with alias layer" /></p>
<p>An OTel stream gets a grok processor that reads from <code>body.text</code>:</p>
<pre><code class="language-json">{ &quot;action&quot;: &quot;grok&quot;, &quot;from&quot;: &quot;body.text&quot;, &quot;patterns&quot;: [&quot;...&quot;] }
</code></pre>
<p>An ECS stream reads from <code>message</code>:</p>
<pre><code class="language-json">{ &quot;action&quot;: &quot;grok&quot;, &quot;from&quot;: &quot;message&quot;, &quot;patterns&quot;: [&quot;...&quot;] }
</code></pre>
<p>OTel streams alias the ECS field names to their OTel equivalents. <code>log.level</code> is an alias for <code>severity_text</code>. <code>message</code> is an alias for <code>body.text</code>. A query written for ECS works on an OTel stream without changes, since the alias layer handles the translation.</p>
<pre><code class="language-json">{
  &quot;message&quot;:    { &quot;path&quot;: &quot;body.text&quot;,     &quot;type&quot;: &quot;alias&quot; },
  &quot;log.level&quot;:  { &quot;path&quot;: &quot;severity_text&quot;, &quot;type&quot;: &quot;alias&quot; }
}
</code></pre>
<p>The agent is aware of which side of this it's on. It doesn't add a rename step for <code>severity_text</code> → <code>log.level</code> on an OTel stream because the alias already provides that mapping. On an ECS stream, it adds the normalization explicitly.</p>
<h2>Schema normalization</h2>
<p>Field extraction is the most important and obvious part, but our fields also need to align.</p>
<p>If two services both log HTTP requests but call the status code field differently (<code>response_status</code> in one, <code>http_code</code> in another), a query for <code>http.response.status_code: 5*</code> returns nothing for either of them. Schema normalization maps both to the standard name:</p>
<pre><code># before: extracted field names from two different services
{ &quot;response_status&quot;: 500 }    # service A
{ &quot;http_code&quot;: 500 }           # service B

# after: ECS normalization
{ &quot;http.response.status_code&quot;: 500 }
</code></pre>
<p>Now every service uses <code>http.response.status_code</code>, and the query works across all of them.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-streams-ai-pipeline-generation/schema-normalization@2x.png" alt="Schema normalization: two services with different field names mapped to a single ECS field" /></p>
<p>During simulation, the agent checks ECS and OTel metadata for every field it generates. Fields that already have standard names are left alone. Fields that map to a known ECS field get renamed. The simulation metrics surface this explicitly: each field in the results includes its ECS or OTel type indicator, so you can see at a glance what's been normalized.</p>
<h2>The bar the agent must clear</h2>
<p>The system prompt sets explicit acceptance criteria for a user-approved pipeline:</p>
<ul>
<li>99% of documents must have a valid <code>@timestamp</code></li>
<li>All fields must have the correct types for the target schema</li>
<li>The overall failure rate must be below 0.5%</li>
</ul>
<p>If the agent can't satisfy all of these within six iterations, the generation fails.</p>
<h2>To summarize</h2>
<p>Pipeline generation takes seconds where the manual process takes hours. The time savings come from automating the validation loop you'd otherwise run by hand: write a pattern, test it against real documents, read the failures, adjust, and try again. The agent does this in up to six cycles against the last documents your stream actually received.</p>
<h2>What's coming next in Streams and processing</h2>
<p>The most user-facing change in progress is the refinement loop. Right now, if the suggestion is close but not exactly right, you edit steps manually and that's it. The next version lets you adjust the proposed pipeline and send it back through the agent with your changes as context, so it builds from where you left off rather than starting from scratch.</p>
<p>Two other things are in progress: generation going async (currently it blocks the UI for a few seconds; soon it runs in the background), and support for streams that already have a pipeline. For now, it only handles streams without existing processing steps.</p>
<p>The same capabilities are also being exposed as callable tools in the Streams agent builder and as APIs for third-party agent frameworks. An agent can run a full pipeline generation as part of a broader onboarding workflow, without the UI.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/elastic-streams-ai-pipeline-generation/og-image@2x.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Fixing Elastic Streams processing failures without dropping data]]></title>
            <link>https://www.elastic.co/observability-labs/blog/elastic-streams-failure-store-processing</link>
            <guid isPermaLink="false">elastic-streams-failure-store-processing</guid>
            <pubDate>Thu, 30 Apr 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[When your Streams ingest pipeline breaks, failed documents land in the failure store, not the floor. Here's how to use those exact failures to fix your pipeline without re-ingesting from the source.]]></description>
            <content:encoded><![CDATA[<p>If you've run a Streams pipeline for more than a week, you've probably hit a processing failure. Before Streams, that often meant dropped data or a dead letter queue at the shipper layer: extra infrastructure you had to operate separately. Here's the recovery loop today.</p>
<h2>When processing fails, data lands in the failure store</h2>
<p>When a Streams pipeline fails (a Grok pattern doesn't match, a field type conflicts with the mapping), the documents that caused the failure are written to the <a href="https://www.elastic.co/docs/manage-data/data-store/data-streams/failure-store">failure store</a>. The failure store is a set of backing indices attached to your data stream. It scales the same way as any other data stream, so it can absorb everything that fails. It's enabled by default for logs as of Elasticsearch 9.2.</p>
<p>The <strong>Data quality</strong> tab gives you insights into the quality of your stream and into documents in the failure store. When failures are accumulating, you'll see a rising count of failed documents along with the error type and a sample of the messages that triggered it.</p>
&lt;div align=&quot;center&quot;&gt;
![Processing failures accumulating in the failure store](/assets/images/elastic-streams-failure-store-processing/processing-failures-in-failure-store.png)
_The Data quality tab showing a rising failure count, error type, and a sample of the documents that triggered it._
&lt;/div&gt;
<p>A Grok expression mismatch (<code>illegal_argument_exception</code>) is sending documents to the failure store. The raw log line doesn't match the expected pattern. The documents aren't dropped. They're in the failure store, ready to debug against.</p>
<h2>Processing: Switch the sample source to the failure store</h2>
<p>Start by navigating to the <strong>Processing</strong> tab.</p>
<p>By default, the editor samples from recent live documents. Switch the sample source to <strong>Failure store</strong> instead: it loads the exact documents that failed, the unmodified originals before any Streams processing ran. You're iterating against the actual failures.</p>
<p>Change the sample source dropdown from the default to Failure store.</p>
&lt;div align=&quot;center&quot;&gt;
![Sample source dropdown showing Latest samples and Failure store options](/assets/images/elastic-streams-failure-store-processing/sample-source-dropdown.png)
_The sample source dropdown with the Failure store option selected._
&lt;/div&gt;
<p>The editor loads up to 100 documents from the failure store and runs them through the current pipeline. You can see exactly where parsing breaks down.</p>
&lt;div align=&quot;center&quot;&gt;
![Pipeline editor with failure store selected as the sample source](/assets/images/elastic-streams-failure-store-processing/failure-store-samples-processing.png)
_The pipeline editor loaded with documents from the failure store instead of recent live samples._
&lt;/div&gt;
## Fix the processor against the actual failures
<p>With the failure store documents loaded as samples, iterate on the processor. The editor shows you the result against the actual failed documents in real time.</p>
<p>In this example, the pipeline was originally built to parse HTTP access logs:</p>
<pre><code>DELETE /api/v1/auth/logout from 26.72.241.177 - Status: 200 - Response time: 38ms - Request ID: req_24363339 - Location: São Paulo, BR - Device: desktop
HEAD /api/v1/notifications from 20.94.145.254 - Status: 202 - Response time: 60ms - Request ID: req_74513322 - Location: Tokyo, JP - Device: mobile
</code></pre>
<p>The original Grok pattern matched those:</p>
<pre><code>%{WORD:http.method} %{URIPATH:uri.path}
</code></pre>
<p>A second log type started flowing in. Cache hits and external API calls arrived in a different format:</p>
<pre><code>cache_hit: Cache hit for key: config
external_api_call: External API call completed - latency: 1695ms - Duration: 598ms
</code></pre>
<p>The original pattern doesn't match these at all. Every one goes straight to the failure store. With the failure store loaded as the sample source, the problem is immediately obvious: the editor shows the parse failing on lines that start with a word followed by a colon, not an HTTP method followed by a path.</p>
<p>The fix is a second pattern to handle the new format:</p>
<pre><code>%{WORD:event.type}: %{GREEDYDATA:message}
</code></pre>
<p>Add it to the processor, and the editor immediately shows both log types parsing correctly against the failure store samples.</p>
<p>When the sample view shows all fields extracting correctly and the parse rate hits 100%, the fix is ready.</p>
&lt;div align=&quot;center&quot;&gt;
![Pipeline editor showing successful parsing against failure store samples](/assets/images/elastic-streams-failure-store-processing/successful-parsing.png)
_Both log types parsing correctly after adding the second Grok pattern. Parse rate at 100%._
&lt;/div&gt;
<p>No guessing — the editor confirms the fix before you save.</p>
<h2>Watch the failure count drop</h2>
<p>Save the updated pipeline. New documents are now processed with the corrected pipeline. Switch back to the Data quality tab and watch the failure count.</p>
&lt;div align=&quot;center&quot;&gt;
![Failure store count dropping after pipeline fix](/assets/images/elastic-streams-failure-store-processing/resolved.png)
_The failure count dropping as new documents are processed by the corrected pipeline._
&lt;/div&gt;
<p>The count drops as the fixed pipeline handles new incoming data correctly. The remaining documents in the failure store are the pre-fix failures. They'll clear out as retention ages them off.</p>
<p>The fix applies to new documents only. Documents already in the failure store aren't automatically reprocessed; each was processed by the pipeline version active when it arrived. If you need them in your main stream, that's a separate step.</p>
<h2>The recovery loop</h2>
<p>Open Data quality, switch to the failure store, fix the processor, save. The whole thing takes a few minutes at most.</p>
<p>No re-ingestion from source. No shipper-level dead letter queue to operate. If you haven't checked the Data quality tab for your streams recently, it's worth a look. There might be failures sitting there that a one-line fix would clear.</p>
<p>For a deeper look at what the Data quality tab shows and how to configure the failure store, see <a href="https://www.elastic.co/observability-labs/blog/data-quality-and-failure-store-in-streams">Elastic Observability: Streams Data Quality and Failure Store Insights</a>.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/elastic-streams-failure-store-processing/processing-failures-in-failure-store.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Streams Processing: Stop Fighting with Grok. Parse Your Logs in Streams.]]></title>
            <link>https://www.elastic.co/observability-labs/blog/elastic-streams-processing</link>
            <guid isPermaLink="false">elastic-streams-processing</guid>
            <pubDate>Thu, 11 Dec 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how Streams Processing works under the hood and how to use it to build, test, and deploy parsing logic on live data quickly.]]></description>
            <content:encoded><![CDATA[<p>With Streams, Elastic's new AI capability in 9.2, we make parsing your logs so simple, it's no longer a concern. In general your logs are messy, lots of fields, some understood, some unknown. You have to constantly keep up with the semantics and pattern match to properly parse them. In some cases, even fields you know have different values or semantics. For instance, <code>timestamp</code> is the ingest time, not the event time. Or you can't even filter by <code>log.level</code> or <code>user.id</code> because they're buried inside the <code>message</code> field. As a result, your dashboards are flat and not useful.</p>
<p>Fixing this used to mean leaving Kibana, learning Grok syntax, manually editing ingest pipeline JSON or a complicated Logstash config, and hoping you didn't break parsing for everything else.</p>
<p>We built Streams to fix this, and much more. It's your one place for data processing, built right into Kibana, that lets you build, test, and deploy parsing logic on live data in seconds. It turns a high-risk backend task into a fast, predictable, interactive UI workflow. You can use AI to generate automated GROK rules from a sample of logs, or build them easily with the UI. Let's walk through an example</p>
<h2>A Quick Walkthrough</h2>
<p>Let's fix a common &quot;unstructured&quot; log right now.</p>
<ol>
<li><strong>Start in Discover</strong>. You find a log that isn't structured. The <code>@timestamp</code> is wrong, and fields like <code>log.level</code> aren't being extracted, so your histograms are just a single-color bar.</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-streams-processing/start-in-discover.png" alt="start in discover" /></p>
<ol start="2">
<li><strong>Inspect the log</strong>. Open the document flyout (the &quot;Inspect a single log event&quot; view). You'll see a button: <strong>&quot;Parse content in Streams&quot;</strong> (or &quot;Edit processing in Streams&quot;). Click it.</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-streams-processing/inspect-the-log.png" alt="inspect the log" /></p>
<ol start="3">
<li><strong>Go to Processing</strong>. This takes you directly to the Streams processing tab, pre-loaded with sample documents from that data stream. Click <strong>&quot;Create your first step.&quot;</strong></li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-streams-processing/go-to-processing.png" alt="go to streams processing" /></p>
<ol start="4">
<li><strong>Generate a Pattern</strong>. The processor defaults to Grok. You don't have to write any. Just click the <strong>&quot;Generate Pattern&quot;</strong> button. Streams analyzes 100 sample documents from your stream and suggests a Grok pattern for you. By default, this uses the Elastic Managed LLM, but you can configure your own.</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-streams-processing/generate-pattern.png" alt="generate the pattern" /></p>
<ol start="5">
<li><strong>Accept and Simulate</strong>. Click &quot;Accept.&quot; Instantly, the UI runs a simulation across all 100 sample documents. You can make changes to the pattern or adjust field names, and the simulation re-runs with every keystroke.</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-streams-processing/accept-and-simulate.png" alt="simulate and accept" /></p>
<p>When you're happy, you save it. Your new logs will now be parsed correctly.</p>
<h2>Powerful Features for Messy, Real-World Logs</h2>
<p>That's the simple case. But real-world data is rarely that clean. Here are the features built to handle the complexity.</p>
<h3>The Interactive Grok UI</h3>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-streams-processing/the-interactive-grok-ui.png" alt="interactive grok" /></p>
<p>When you use the Grok processor, the UI gives you a <strong>visual indication</strong> of what your pattern is extracting. You can see which parts of the <code>message</code> field are being mapped to which new field names. This immediate feedback means you're not just guessing. Autocompletion of GROK patterns and instant pattern validation are also part of it.</p>
<h3>The Diff Viewer</h3>
<p>How do you know what exactly changed? Expand any row in the simulation table. You'll get a diff view showing precisely which fields were added, removed, or modified for that specific document. No more guesswork.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-streams-processing/the-diff-viewer.png" alt="the diff viewer" /></p>
<h3>End to End Simulation and Detecting Failures</h3>
<p>This is the most critical part. Streams doesn't just simulate the processor; it simulates the entire indexing process.
If you try to map a non-timestamp string (like the <code>message</code> field) directly to the <code>@timestamp</code> field, the simulation will show a failure. It detects the mapping conflict before you save it and before it can create a data-mapping conflict in your cluster. This safety net is what lets you move fast.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-streams-processing/end-to-end-simulation.png" alt="end to end simulation" /></p>
<h3>Conditional Processing</h3>
<p>What if one data stream contains a large variety of logs? You can't use one Grok pattern for all.</p>
<p>Streams has conditional processing built for this. The UI lets you build &quot;if-then&quot; logic. The UI shows you exactly what percentage of your sample documents are skipped or processed by your conditions. Right now, the UI supports up to 3 levels of nesting, and we plan to add a YAML mode in the future for more complex logic.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-streams-processing/conditional-processing.png" alt="conditional processing" /></p>
<h3>Changing Your Test Data (Document Samples)</h3>
<p>A random 100-document sample isn't always helpful, especially in a massive, mixed stream from Kubernetes or a central message broker.</p>
<p>You can change the document sample to test your changes on a more specific set of logs. You can either provide documents manually (copy-paste) or, more powerfully, specify a KQL query to fetch 100 specific documents. For example: <code>service.name : &quot;data_processing&quot;</code>, to fetch 100 additional sample documents to be used in the simulation. Now you can build and test a processor on the exact logs you care about.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-streams-processing/changing-your-test-data.png" alt="changing your test data" /></p>
<h2>How Processing Works Under the Hood</h2>
<p>There’s no magic. In simple terms, it's a UI that makes our existing best practices more accessible. As of version 9.2, Streams runs exclusively on <strong>Elasticsearch ingest pipelines</strong>. (We have plans to offer more than that, stay tuned)</p>
<p>When you save your changes, Streams appends processing steps by:</p>
<ol>
<li>Locating the most specific <code>@custom</code> ingest pipeline for your data stream.</li>
<li>Adding a single <code>pipeline</code> processor to it.</li>
<li>This processor calls a new, dedicated pipeline named <code>&lt;stream-name&gt;@stream.processing</code>, which contains the Grok, conditional, and other logic you built in the UI.</li>
</ol>
<p>You can even see this for yourself by going to the <strong>Advanced tab</strong> in your Stream and clicking the pipeline name.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-streams-processing/how-processing-works.png" alt="how processing works" /></p>
<h2>Processing in OTel, Elastic Agent, Logstash, or Streams? What to Use?</h2>
<p>This is a fair question. You have lots of ways to parse data.</p>
<ul>
<li><strong>Best: Structured logging at the Source</strong>. If you control the app writing the logs, make it log JSON in the right format of your choice. This will always stay the best way to do logging, but it's not always possible.</li>
<li><strong>Good, but not all the time: Elastic Agent + Integrations:</strong> If there is an existing integration for collecting and parsing your data, Streams won't do it any better. Use it!</li>
<li><strong>Good for tech savvy users: OTel at the Edge</strong>. Use OTel (with OTTL) to set yourself up for the future.</li>
<li><strong>The easy Catch-All: In Streams</strong>. Especially when using an Integration that primarily just ships the data into Elastic, Streams can add a lot of value. The Kubernetes Logs integration is a good example of this where an Integration is used, but most logs aren't parsed automatically as they may be from a wide variety of pods.</li>
</ul>
<p>Think of Streams as your universal &quot;catch-all&quot; for everything that arrives unstructured. It's perfect for data from sources you don't control, for legacy systems, or for when you just need to fix a parsing error right now without a full application redeploy.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-streams-processing/processing-in-otel.png" alt="processing in otel" /></p>
<p>A quick note on schemas: Streams can handle both ECS (Elastic Common Schema) and OTel (OpenTelemetry) data. By default, it assumes your target schema is ECS. However, Streams will automatically detect and adapt to the OTel schema if your Stream's name contains the word “otel”, or if you're using the special Logs Stream (currently in tech preview). You get the same visual parsing workflow regardless of the schema.</p>
<p>All processing changes can also be made using a Kibana API. Note that the API is still in tech preview while we mature some of the functionality.</p>
<h2>Summary</h2>
<p>Parsing logs shouldn't be a tedious, high-stakes, backend-only task. Streams moves the entire workflow from a complex, error-prone approach to an interactive UI right where you already are. You can now build, test, and deploy parsing logic with instant, safe feedback. This means you can stop fighting your logs and finally start using them. The next time you see a messy log, don't ignore it. Click &quot;Parse in Streams&quot; and fix it in 60 seconds.</p>
<p>Check out more log analytics articles in <a href="https://www.elastic.co/observability-labs/blog/tag/log-analytics">Elasitc Observability Labs</a>.</p>
<p>Try out Elastic. Sign up for a trial at <a href="https://cloud.elastic.co/registration?fromURI=%2Fhome">Elastic Cloud</a>.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/elastic-streams-processing/cover.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[How to cut Elasticsearch log storage costs with LogsDB]]></title>
            <link>https://www.elastic.co/observability-labs/blog/elasticsearch-logsdb-index-mode-storage-savings</link>
            <guid isPermaLink="false">elasticsearch-logsdb-index-mode-storage-savings</guid>
            <pubDate>Thu, 09 Apr 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how to enable LogsDB index mode in Elasticsearch and measure real storage savings. We compare a standard index against a LogsDB index using Apache logs and show how much storage you can reclaim.]]></description>
            <content:encoded><![CDATA[<p>LogsDB is a specialized Elasticsearch index mode that gives you full functionality at a fraction of the storage cost. Your Kibana dashboards, searches, alerts, and visualizations all continue to work exactly as before. No data is discarded. No queries need to be updated. No workflows break. It is one setting, and everything else gets cheaper.</p>
<p>In benchmarks, LogsDB brought a dataset from <strong>162.7 GB down to 39.4 GB</strong> — a <strong>76% reduction in storage</strong>. You can explore the full nightly benchmark results at <a href="https://elasticsearch-benchmarks.elastic.co/#tracks/logsdb/nightly/default/90d">elasticsearch-benchmarks.elastic.co</a>.</p>
<p>In this tutorial you'll reproduce the experiment yourself using Kibana Dev Tools and an Apache logs dataset. You'll create two identical indices, ingest the same documents into both, and measure the storage difference with the <code>_stats</code> API. By the end, you'll see a 44% reduction on your test data — and understand exactly why production numbers push even higher.</p>
<blockquote>
<p><strong>Already on Elasticsearch 9.2+?</strong> Any data stream with a <code>logs-</code> prefix already uses LogsDB by default. Jump to <a href="#what-about-your-existing-logs">What about your existing logs?</a> to verify your setup.</p>
</blockquote>
<blockquote>
<p><strong>Want the full picture?</strong> For the engineering history behind these savings — how Lucene doc values, synthetic <code>_source</code>, index sorting, and ZSTD were developed and stacked over twelve years — see <a href="https://www.elastic.co/observability-labs/blog/elasticsearch-logsdb-storage-evolution"><em>Elasticsearch over the years: how LogsDB cuts index size by up to 75%</em></a>.</p>
</blockquote>
<h2>Prerequisites</h2>
<ul>
<li>Elasticsearch 8.17+ cluster, Elastic Cloud deployment, or Serverless</li>
<li>Kibana with Dev Tools access</li>
<li>Some logs</li>
<li>Basic familiarity with running API calls in Kibana Dev Tools</li>
</ul>
<h2>How LogsDB saves storage</h2>
<p>LogsDB stacks three mechanisms to achieve its storage reduction:</p>
<ul>
<li><strong>Index sorting</strong> — documents are sorted by <code>host.name</code> then <code>@timestamp</code>, grouping similar log lines so compression codecs find far more repeated patterns. Sorting alone accounts for roughly 30% of the savings.</li>
<li><strong>ZSTD compression with delta/GCD/run-length encoding</strong> — <code>best_compression</code> switches from LZ4 to Zstandard and applies numeric codecs to each doc values column. The standard index in this tutorial uses LZ4, so part of what you're measuring is the full package LogsDB delivers automatically.</li>
<li><strong>Synthetic <code>_source</code></strong> — Elasticsearch skips storing the raw JSON blob entirely and reconstructs <code>_source</code> on demand from doc values, adding another 20–40% of savings on top.</li>
</ul>
<blockquote>
<p><strong>Synthetic <code>_source</code> trade-offs:</strong> Field ordering in returned documents may differ from the original, and some edge cases around multi-value array fields behave differently. For most log analytics workloads these differences are invisible, but check the <a href="#next-steps">synthetic <code>_source</code> documentation</a> before enabling it in latency-sensitive applications.</p>
</blockquote>
<p>For a deep dive into the architecture behind each mechanism, see <a href="https://www.elastic.co/observability-labs/blog/elasticsearch-logsdb-storage-evolution"><em>Elasticsearch over the years: how LogsDB cuts index size by up to 75%</em></a>.</p>
<p>Let's now walk through the steps you can take to enable LogsDB and measure the storage savings.</p>
<h2>Step 1: Collect logs with Elastic Agent</h2>
<p>The recommended way to ingest Apache logs into Elasticsearch is through Elastic Agent with the Apache integration. It handles collection, parsing, ECS field mapping, and routing automatically.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elasticsearch-logsdb-index-mode-storage-savings/integration.png" alt="Elastic Agent Apache integration setup in Kibana" /></p>
<p>Browse all available integrations in the <a href="https://www.elastic.co/integrations">Elastic integrations catalog</a>.</p>
<p>Once the Agent is collecting logs and routing them to <code>logs-apache.access-*</code>, move to the next step.</p>
<h2>Step 2: Create the two indices</h2>
<p>All commands in this tutorial are run in <strong>Kibana Dev Tools</strong>.</p>
<p>Create one standard index and one LogsDB index with identical field mappings. The only difference is <code>&quot;index.mode&quot;: &quot;logsdb&quot;</code>.</p>
<p><strong>Standard index:</strong></p>
<pre><code class="language-json">PUT /apache-standard
{
  &quot;mappings&quot;: {
    &quot;properties&quot;: {
      &quot;@timestamp&quot;:                  { &quot;type&quot;: &quot;date&quot; },
      &quot;host.name&quot;:                   { &quot;type&quot;: &quot;keyword&quot; },
      &quot;http.request.method&quot;:         { &quot;type&quot;: &quot;keyword&quot; },
      &quot;url.path&quot;:                    { &quot;type&quot;: &quot;keyword&quot; },
      &quot;http.version&quot;:                { &quot;type&quot;: &quot;keyword&quot; },
      &quot;http.response.status_code&quot;:   { &quot;type&quot;: &quot;integer&quot; },
      &quot;http.response.bytes&quot;:         { &quot;type&quot;: &quot;integer&quot; },
      &quot;http.request.referrer&quot;:       { &quot;type&quot;: &quot;keyword&quot; },
      &quot;user_agent.original&quot;:         { &quot;type&quot;: &quot;keyword&quot; }
    }
  }
}
</code></pre>
<p><strong>LogsDB index:</strong></p>
<pre><code class="language-json">PUT /apache-logsdb
{
  &quot;settings&quot;: {
    &quot;index.mode&quot;: &quot;logsdb&quot;
  },
  &quot;mappings&quot;: {
    &quot;properties&quot;: {
      &quot;@timestamp&quot;:                  { &quot;type&quot;: &quot;date&quot; },
      &quot;host.name&quot;:                   { &quot;type&quot;: &quot;keyword&quot; },
      &quot;url.path&quot;:                    { &quot;type&quot;: &quot;keyword&quot; },
      &quot;http.request.method&quot;:         { &quot;type&quot;: &quot;keyword&quot; },
      &quot;http.version&quot;:                { &quot;type&quot;: &quot;keyword&quot; },
      &quot;http.response.status_code&quot;:   { &quot;type&quot;: &quot;integer&quot; },
      &quot;http.response.bytes&quot;:         { &quot;type&quot;: &quot;integer&quot; },
      &quot;http.request.referrer&quot;:       { &quot;type&quot;: &quot;keyword&quot; },
      &quot;user_agent.original&quot;:         { &quot;type&quot;: &quot;keyword&quot; }
    }
  }
}
</code></pre>
<p>That single <code>&quot;index.mode&quot;: &quot;logsdb&quot;</code> line activates all three storage mechanisms. Elasticsearch enables these additional settings behind the scenes — you don't set any of them manually:</p>
<pre><code class="language-json">{
  &quot;index.sort.field&quot;:              [&quot;host.name&quot;, &quot;@timestamp&quot;],
  &quot;index.sort.order&quot;:              [&quot;asc&quot;, &quot;desc&quot;],
  &quot;index.codec&quot;:                   &quot;best_compression&quot;,
  &quot;index.mapping.ignore_malformed&quot;: true,
  &quot;index.mapping.ignore_above&quot;:    8191
}
</code></pre>
<h2>Step 3: Reindex the logs</h2>
<p>Use the <code>_reindex</code> API to copy the same documents into both test indices:</p>
<pre><code class="language-json">POST /_reindex
{
  &quot;source&quot;: { &quot;index&quot;: &quot;logs-apache.access-*&quot; },
  &quot;dest&quot;:   { &quot;index&quot;: &quot;apache-standard&quot; }
}

POST /_reindex
{
  &quot;source&quot;: { &quot;index&quot;: &quot;logs-apache.access-*&quot; },
  &quot;dest&quot;:   { &quot;index&quot;: &quot;apache-logsdb&quot; }
}
</code></pre>
<p>Both indices now hold identical documents, so the storage comparison in the next step reflects only the index mode difference.</p>
<h2>Step 4: Force merge for a fair comparison</h2>
<p>Before measuring, force merge both indices to a single segment:</p>
<pre><code>POST /apache-standard/_forcemerge?max_num_segments=1

POST /apache-logsdb/_forcemerge?max_num_segments=1
</code></pre>
<p>These calls block until the merge finishes. Wait for both responses before continuing.</p>
<p><strong>Why this matters:</strong> Elasticsearch writes data into multiple Lucene segments before merging them in the background. Measuring mid-merge gives artificially inflated numbers because each segment is compressed independently. Forcing a single segment shows the real steady-state storage footprint you'd see in a mature production index.</p>
<blockquote>
<p><strong>Only run <code>_forcemerge</code> on indices that are no longer being written to.</strong> Force merging an index that is still receiving writes is resource-intensive and can impact ingestion performance. In production, you can use <a href="https://www.elastic.co/docs/manage-data/lifecycle/index-lifecycle-management">Index Lifecycle Management (ILM)</a> to automate force merges as part of the warm or cold phase, once an index is rolled over and no longer actively ingested into.</p>
</blockquote>
<h2>Step 5: Measure the difference</h2>
<pre><code>GET /apache-standard/_stats?filter_path=indices.*.primaries.store

GET /apache-logsdb/_stats?filter_path=indices.*.primaries.store
</code></pre>
<p>The <code>filter_path</code> parameter keeps the response focused. Look for <code>primaries.store.size_in_bytes</code> in each response.</p>
<p>In our test with Apache log records, the results were:</p>
<table>
<thead>
<tr>
<th>Index</th>
<th>Documents</th>
<th>Size</th>
</tr>
</thead>
<tbody>
<tr>
<td>apache-standard</td>
<td>111,818</td>
<td>15.37 MB</td>
</tr>
<tr>
<td>apache-logsdb</td>
<td>111,818</td>
<td>8.6 MB</td>
</tr>
<tr>
<td><strong>Reduction</strong></td>
<td></td>
<td><strong>44%</strong></td>
</tr>
</tbody>
</table>
<p>To put this in perspective: at 1 TB of log data, LogsDB brings that down to around 560 GB. That's 450 GB saved without any changes to your queries. At production scale with billions of documents and synthetic <code>_source</code> enabled, savings push to 76% — taking 162.7 GB down to 39.4 GB in our benchmark.</p>
<h2>Visualize in Kibana</h2>
<p>To see the storage difference visually, open Kibana and go to <strong>Management → Stack Management → Index Management</strong>. You'll see both indices listed with their current sizes side by side.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elasticsearch-logsdb-index-mode-storage-savings/index-stats.png" alt="Kibana Index Management showing storage comparison between standard and LogsDB indices" /></p>
<blockquote>
<p><strong>Why Kibana shows larger numbers than <code>_stats</code>:</strong> Kibana Index Management displays the total index size including all replica shards. The <code>_stats</code> query above uses <code>primaries</code> to report primary shards only. The ratio between the two indices remains the same either way.</p>
</blockquote>
<h2>What about your existing logs?</h2>
<h3>Elasticsearch 9.2+ (already enabled by default)</h3>
<p>Since 9.2, any data stream matching the <code>logs-*</code> naming pattern automatically uses LogsDB. You're likely already saving storage without any configuration change.</p>
<p>Verify your existing data streams:</p>
<pre><code>GET /.ds-logs-*/_settings?filter_path=*.settings.index.mode
</code></pre>
<p>If you see <code>&quot;index.mode&quot;: &quot;logsdb&quot;</code> in the responses, you're already getting the savings.</p>
<h3>Elasticsearch 8.x or 9.0–9.1 (enable per data stream via index template)</h3>
<p>For earlier versions, enable LogsDB on a data stream by updating its index template. This affects all new indices created from that template — existing indices are not changed, so the transition is safe and gradual.</p>
<p><strong>Option A — Update an existing template:</strong></p>
<pre><code class="language-json">PUT _index_template/logs-myapp-template
{
  &quot;index_patterns&quot;: [&quot;logs-myapp-*&quot;],
  &quot;data_stream&quot;: {},
  &quot;template&quot;: {
    &quot;settings&quot;: {
      &quot;index.mode&quot;: &quot;logsdb&quot;
    }
  },
  &quot;priority&quot;: 200
}
</code></pre>
<p><strong>Option B — Check and patch an existing integration template:</strong></p>
<p>First, find the template managing your data stream:</p>
<pre><code>GET _index_template/logs-apache*
</code></pre>
<p>Then add the <code>index.mode</code> setting to the <code>template.settings</code> block using a <code>PUT _index_template/&lt;name&gt;</code> call with the full template body including your addition.</p>
<p>After updating the template, the next index rollover will use LogsDB. Trigger a rollover immediately if you don't want to wait:</p>
<pre><code>POST /logs-myapp-default/_rollover
</code></pre>
<p><strong>Upgrading from 8.x to 9.0+:</strong> Existing data streams are not changed automatically. Only new rollovers will use LogsDB. There is no data loss and no reindexing required — the savings accumulate as new indices roll over.</p>
<h2>What about query performance?</h2>
<p>LogsDB does not significantly impact query performance for typical log analytics workloads. The index sorting by <code>host.name</code> and <code>@timestamp</code> can actually <em>improve</em> range query and aggregation performance on those fields, since matching documents are stored adjacently. Queries that don't filter on those fields perform comparably to a standard index.</p>
<p>For indexing throughput data across releases, see the <a href="https://www.elastic.co/observability-labs/blog/elasticsearch-logsdb-storage-evolution#performance-not-just-storage">performance section</a> of the companion article.</p>
<h2>Conclusion</h2>
<p>LogsDB activates with a single <code>&quot;index.mode&quot;: &quot;logsdb&quot;</code> setting and delivers measurable storage savings immediately: 44% in our hands-on test, and 76% (162.7 GB → 39.4 GB) in production benchmarks with synthetic <code>_source</code>. On Elasticsearch 9.2+, <code>logs-*</code> data streams already use LogsDB by default. For 8.x or earlier 9.x clusters, a one-line index template change enables it on your next rollover with no data loss and no reindexing required.</p>
<h2>Next steps</h2>
<ul>
<li><a href="https://www.elastic.co/docs/reference/elasticsearch/index-settings/logsdb">LogsDB index mode documentation</a></li>
<li><a href="https://www.elastic.co/docs/reference/elasticsearch/mapping/synthetic-source">Synthetic <code>_source</code> documentation and limitations</a></li>
<li><a href="https://www.elastic.co/docs/manage-data/data-store/data-streams/logs-data-stream">Configuring a logs data stream</a></li>
<li><a href="https://www.elastic.co/blog/logsdb-index-mode-generally-available">LogsDB GA announcement</a></li>
<li><a href="https://www.elastic.co/blog/elasticsearch-logsdb-tsds-benchmarks">LogsDB and TSDS performance benchmarks</a></li>
</ul>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/elasticsearch-logsdb-index-mode-storage-savings/header.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Elasticsearch over the years — how LogsDB cuts index size by up to 75% at no throughput cost]]></title>
            <link>https://www.elastic.co/observability-labs/blog/elasticsearch-logsdb-storage-evolution</link>
            <guid isPermaLink="false">elasticsearch-logsdb-storage-evolution</guid>
            <pubDate>Thu, 09 Apr 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[By default, Elasticsearch is optimized for retrieval, not storage. LogsDB changes that. Here's the layered architecture behind a 77% index size reduction.]]></description>
            <content:encoded><![CDATA[<p>Elasticsearch was built as a search engine. That heritage has a cost for log storage: every event fans out to multiple on-disk structures, each optimized for retrieval rather than compression. LogsDB changes both. On our nightly benchmark, Enterprise mode produces a 37.5 GB index from the same data that takes 161.9 GB without LogsDB — a 77% reduction from a single setting.</p>
&lt;div align=&quot;center&quot;&gt;
![Standard vs LogsDB storage breakdown](/assets/images/elasticsearch-logsdb-storage-evolution/storage-breakdown-v3-bold@2x.png)
&lt;/div&gt;
<h2>The write overhead</h2>
<p>Lucene, the library underneath, keeps multiple structures for every indexed document:</p>
<ul>
<li>The <strong>inverted index</strong> maps terms to documents. This is what makes text search fast.</li>
<li><strong><code>_source</code></strong> stores the original JSON blob, returned when you fetch a document.</li>
<li><strong>Doc values</strong> store field values in columns for sorting and aggregation.</li>
<li><strong>Points / BKD trees</strong> index numeric and date fields for range queries.</li>
</ul>
<p>The inverted index earns its keep: it's what lets you search a billion log lines by keyword in milliseconds, and there's no cheaper way to build that capability. <code>_source</code> exists to give you back exactly what you indexed: search results and <code>GET</code> requests return this blob directly. The problem is that it stores the full event even though the same field values are already available through doc values and the other structures.</p>
<p>Take a log event with fields like <code>host.name</code>, <code>@timestamp</code>, <code>http.response.status_code</code>, and <code>duration_ms</code>. The entire event is serialized as JSON in <code>_source</code>. The same field values are also written into doc values columns, indexed into the inverted index, and stored in BKD trees for range queries. Same data, multiple structures, each with its own on-disk footprint.</p>
<p>For a search engine where you need fast retrieval across all dimensions, that overhead is a reasonable tradeoff. For logs, where you rarely need the raw JSON and almost never do relevance-ranked search, much of it is pure waste.</p>
&lt;div align=&quot;center&quot;&gt;
![One incoming log event fans out to four on-disk structures](/assets/images/elasticsearch-logsdb-storage-evolution/dual-storage-bold@2x.png)
_One write, four on-disk structures: `_source` (the raw JSON blob), the inverted index, doc values columns, and BKD / points trees for numeric range queries. The same field values end up in multiple places._
&lt;/div&gt;
<h2>Why columnar storage matters for compression</h2>
<p>Doc values are the key to everything LogsDB does. Unlike <code>_source</code>, which stores entire documents as blobs, doc values store each field as a separate column across all documents in a Lucene segment.</p>
<p>Picture a segment with a million log events. The <code>_source</code> representation is a million JSON blobs, one per event, each containing all fields jumbled together. The doc values representation is a set of columns: one column of a million timestamps, one column of a million host names, one column of a million status codes, and so on.</p>
&lt;div align=&quot;center&quot;&gt;
![Row-oriented vs column-oriented storage](/assets/images/elasticsearch-logsdb-storage-evolution/doc-values-columns-bold@2x.png)
_Row-oriented `_source` keeps all fields for each document in one blob — doc0 through doc5 each carry `host.name`, `@timestamp`, `status`, `duration_ms`, and more jumbled together. Column-oriented doc values restructure the same data so all `host.name` values sit in one column, all timestamps in another, all status codes in another. Compression codecs can then run on each contiguous column independently._
&lt;/div&gt;
<p>That columnar layout is what makes per-column compression possible. When all values of <code>http.response.status_code</code> sit in a contiguous column, Lucene can apply codecs that exploit patterns in the sequence.</p>
<p>Delta encoding stores differences between adjacent values instead of full values. GCD encoding finds a common factor and divides everything down. Run-length encoding collapses repeats. Lucene picks the codec per segment and re-evaluates when segments merge.</p>
&lt;div align=&quot;center&quot;&gt;
![Numeric codec pipeline: RAW → DELTA → GCD → BIT-PACK](/assets/images/elasticsearch-logsdb-storage-evolution/numeric-codec-pipeline-bold@2x.png)
_Four sorted `@timestamps` from the same host, compressed in four stages. RAW: four 32-bit integers, 128 bits total. DELTA: store differences instead of full values — base stays, deltas +100, +200, +300 take 59 bits. GCD: divide out the common factor of 100, leaving 1, 2, 3 at 39 bits. BIT-PACK: pack those three small integers into contiguous bit storage, 9 bits freed._
&lt;/div&gt;
<p>But here's the catch: these codecs only work well when adjacent documents have correlated values. Consider the <code>@timestamp</code> column.</p>
<p>If logs arrive from dozens of hosts interleaved randomly, the timestamps in the column jump around. The delta between adjacent values might be +3 seconds, then -47 seconds, then +120 seconds. Delta encoding can't do much with that.</p>
<p>Now consider what happens if you sort by <code>host.name</code> and <code>@timestamp</code> before writing to the segment. All logs from host-A land in a contiguous run, followed by all logs from host-B, and so on. Within each host's run, the timestamps are monotonically increasing and the deltas are predictable.</p>
<p>Four timestamps from the same host might look like 1706745600, +100s, +200s, +300s. Delta encoding shrinks those to a base value plus three small integers.</p>
<p>GCD encoding finds that 100, 200, 300 are all divisible by 100 and stores 1, 2, 3 instead. Bit-packing then fits those three values into a handful of bits. The same pattern applies to fields like <code>host.name</code>, <code>service.name</code>, or <code>http.response.status_code</code>: within a sorted run, long stretches of identical values collapse to near nothing under run-length encoding.</p>
&lt;div align=&quot;center&quot;&gt;
![Index sorting: arrival order → sorted by host.name → after RLE](/assets/images/elasticsearch-logsdb-storage-evolution/index-sorting-bold@2x.png)
_Five hosts — api-01, api-02, db-01, web-01, web-02 — scattered randomly in arrival order (left). Sorting by `host.name` groups them into five contiguous blocks of eight (center). Run-length encoding collapses each block to a single (value, count) pair — 5 pairs stored instead of 40, the remaining slots freed (right)._
&lt;/div&gt;
<p>Elasticsearch never sorted by default. Documents landed in arrival order, compressed with DEFLATE. We left a lot on the table.</p>
<h2>How we got here: 2012–2026</h2>
<p>Not all of the individual techniques in LogsDB were designed for logs. They were built over twelve years to solve different problems, and LogsDB is what happens when you stack them.</p>
<p><strong>The foundation (2012–2017).</strong> Lucene 4.0 introduced doc values in 2012. By Elasticsearch 5.0 in 2016, they were on by default for all keyword and numeric fields. Lucene 7.0 added sparse doc values, so fields that only appear in some documents don't waste space on every document in the segment. That fixed a significant force-merge bloat problem (up to 10× on sparse fields) and set up the storage model everything else depends on.</p>
&lt;div align=&quot;center&quot;&gt;
![Dense vs sparse doc values encoding](/assets/images/elasticsearch-logsdb-storage-evolution/sparse-doc-values-bold@2x.png)
_Dense encoding reserves an 8-byte slot per document regardless of presence. Sparse encoding stores only documents that have a value at 12 bytes each (value + doc ID). For `error_code` with 2 of 16 docs populated (12% fill), sparse is 81% smaller: 24 B vs 128 B. For `request_path` at 88% fill, sparse is larger: 168 B vs 128 B. Lucene picks per field; sparse wins below ~67% fill._
&lt;/div&gt;
<p><strong>Incremental wins (2020–2021).</strong> Two smaller changes targeted observability workloads. Dictionary-based stored fields compression deduplicated repetitive string metadata for about a 10% win.</p>
<p>The <code>match_only_text</code> field type dropped term frequencies and positions from the inverted index. Term frequencies are what BM25 uses to score documents by relevance — how often a term appears in a document relative to the rest of the corpus. For log search that signal is meaningless: you don't care whether &quot;timeout&quot; appeared twice or seven times in a log line, you just want to find it. Positions are similar: they're stored so Elasticsearch can do exact phrase matching, but the position data is expensive and phrase queries on logs are rare enough that the tradeoff is worth it. When you do run a phrase query on a <code>match_only_text</code> field, it still works — it just falls back to a slower path that rescores candidates rather than using stored positions directly.</p>
&lt;div align=&quot;center&quot;&gt;
![text vs match_only_text inverted index storage](/assets/images/elasticsearch-logsdb-storage-evolution/match-only-text-bold@2x.png)
_`text` stores each term with its frequency and every position it appears at. `match_only_text` keeps only the doc IDs — enough to find the document, nothing more. The `timeout` term appears twice in this message (positions 1 and 4), which is exactly the kind of data that gets dropped._
&lt;/div&gt;
<p>Dropping frequencies and positions cuts the inverted index for a text field by roughly 40%. The overall index impact in 2021 was only ~10%, which sounds like a poor return on a 40% field-level reduction. The reason is where storage was going at the time: <code>_source</code> was stored in full for every document as a raw JSON blob, doc values were uncompressed and unsorted, and nothing was using ZSTD. The <code>message</code> field's inverted index was a small slice of a much larger, poorly-compressed whole. As the next five years of work addressed those other structures, the same 40% field-level savings became a meaningful fraction of a much smaller total.</p>
<p>Neither change was decisive on its own, but they established that log-specific storage optimization was worth pursuing.</p>
<p><strong>The TSDB turning point (April 2023).</strong> This is where the story really starts. We shipped synthetic <code>_source</code> and index sorting for time series metrics in Elasticsearch 8.7.</p>
<p>Synthetic source changes the write-and-read contract. At write time, we skip storing the raw JSON blob entirely. At read time, when a query needs to return the original document, we reconstruct it by reading each field's value out of doc values and stored fields and assembling them back into JSON. The result is functionally equivalent to the original <code>_source</code> (with minor differences like field ordering), but we never stored the blob.</p>
<p>Index sorting groups documents by dimension fields and timestamp before writing to disk. Together, synthetic source and index sorting cut metrics storage by up to 70%.</p>
<p>That result told us something important: the same architecture could work for logs.</p>
&lt;div align=&quot;center&quot;&gt;
![Standard _source vs synthetic _source](/assets/images/elasticsearch-logsdb-storage-evolution/synthetic-source-bold@2x.png)
_Without LogsDB, Elasticsearch writes every log event twice: once as a raw `_source` blob on disk, once into doc values columns. LogsDB skips the blob entirely. At read time, a `GET &lt;index&gt;/_doc/1` request gathers field values from doc values and assembles the document on the fly._
&lt;/div&gt;
<p><strong>The TSDB codec (2024).</strong> In 8.13 and 8.14, we built a custom doc values codec with run-length encoding optimized for sorted consecutive values, PFOR-delta encoding, and cyclic ordinal encoding for multi-valued dimensions. The numbers were striking: <code>kubernetes.pod.name</code> doc values dropped from 110 MB to 7.25 MB in one benchmark. We extended coverage to all numeric and keyword types including <code>ip</code>, <code>scaled_float</code>, and <code>unsigned_long</code>.</p>
<p><strong>LogsDB Tech Preview (August 2024).</strong> In <a href="https://github.com/elastic/elasticsearch/pull/108896">8.15</a>, we combined everything into <code>index.mode: logsdb</code>: host-first sorting, synthetic <code>_source</code>, ZSTD compression, and the TSDB numeric codecs. One decision mattered more than expected: sort order. Sorting by <code>host.name</code> first, then <code>@timestamp</code>, delivers up to ~40% storage reduction. Sorting by timestamp first gives ≤10%. The host-first ordering co-locates documents that share field values, which is exactly what the numeric codecs need.</p>
<p><strong>ZSTD and GA (November–December 2024).</strong> In <a href="https://github.com/elastic/elasticsearch/pull/112665">8.16</a>, we switched <code>best_compression</code> from DEFLATE to ZSTD permanently (level 3, blocks up to 2,048 documents or 240 kB, native bindings via Panama FFI on JDK 21+). ZSTD gave us ~12% smaller stored fields and ~14% higher indexing throughput at the same time, which almost never happens. LogsDB went GA in 8.17.</p>
<p>At GA, we claimed up to 65% storage reduction.</p>
<p><strong>Routing and recovery (April 2025).</strong> In 8.18, <a href="https://github.com/elastic/elasticsearch/pull/116687"><code>route_on_sort_fields</code></a> started routing documents to shards by sort field values instead of <code>_id</code>. Without this optimization, Elasticsearch hashes the <code>_id</code> to pick a shard, so logs from the same host scatter across all shards. With routing on sort fields, logs with similar <code>host.name</code> values land on the same shard. This co-locates similar documents at the shard level, not just within segments, adding ~20% storage reduction at a 1–4% ingest penalty. Routing on sort fields requires auto-generated <code>_id</code>.</p>
&lt;div align=&quot;center&quot;&gt;
![Shard routing: standard, routed, routed + sorted](/assets/images/elasticsearch-logsdb-storage-evolution/shard-routing-bold@2x.png)
_Data stream `.ds-logs-nginx-default-00001` with six hosts across three shards. STANDARD (hashed by `_id`): all host colors scattered randomly. ROUTED (`route_on_sort_fields`): same-host logs land on the same shard, but remain in arrival order within it. ROUTED + SORTED (host-first sort): each shard contains contiguous blocks of a single host — the combination that lets numeric codecs and RLE reach their full potential._
&lt;/div&gt;
<p>We also <a href="https://github.com/elastic/elasticsearch/pull/119110">switched peer recovery to synthetic source reconstruction</a>, eliminating the duplicate <code>_recovery_source</code> blob. In <a href="https://github.com/elastic/elasticsearch/pull/121049">9.0</a>, <code>logs-*-*</code> indices default to LogsDB.</p>
&lt;div align=&quot;center&quot;&gt;
![Index size written: _recovery_source eliminated](/assets/images/elasticsearch-logsdb-storage-evolution/recovery-source-bold@2x.png)
_Nightly synthetic source benchmark, December 2024. Index size written drops 39% — from ~279 GB to ~171 GB — the day peer recovery switches from copying the raw `_recovery_source` blob to reconstructing documents from doc values._
&lt;/div&gt;
<p><strong>Merge and recovery overhaul: 9.1 (July 2025).</strong> We fully eliminated the recovery source. Peer recovery uses batched synthetic reconstruction, cutting write I/O by ~50% and boosting median indexing throughput ~19% over the 8.17 baseline. We replaced up to four separate doc values merge passes with a single pass, cutting background merge CPU by up to 40%. And we swapped <code>_seq_no</code>'s BKD tree for Lucene doc value skippers, halving <code>_seq_no</code> storage.</p>
<p><strong>pattern_text and Failure Store: 9.2–9.3 (October 2025–February 2026).</strong> In <a href="https://github.com/elastic/elasticsearch/pull/124323">9.2</a>, we shipped <code>pattern_text</code> as a Tech Preview: a new field type that decomposes log messages into static templates and dynamic variable parts. A log line like <code>Session opened for user alice from 10.0.1.42 via TLS</code> gets split into the template <code>Session opened for user {} from {} via TLS</code> (stored once, as a template ID) and the variables <code>alice</code>, <code>10.0.1.42</code> (stored per document). For logs with high template repetition, this cuts message field storage by up to 50%. A companion <code>template_id</code> sub-field lets you sort by template, and the LogsDB setting <code>index.logsdb.default_sort_on_message_template</code> enables this automatically. <code>pattern_text</code> <a href="https://github.com/elastic/elasticsearch/pull/135370">went GA in 9.3</a>.</p>
&lt;div align=&quot;center&quot;&gt;
![TEXT vs PATTERN_TEXT field type](/assets/images/elasticsearch-logsdb-storage-evolution/pattern-text-bold@2x.png)
_TEXT stores each log message as a full string per document — eight copies of near-identical blobs. PATTERN_TEXT decomposes them: the shared template `Session opened for user {} from {} via TLS` is stored once with ID T0, and only the variable columns (`user`, `ip`) are stored per document — alice/10.0.1.42, bob/10.0.1.87, carol/10.0.2.11, and so on._
&lt;/div&gt;
<p><code>pattern_text</code> does come with an indexing CPU cost: decomposing each message into template and variables takes more work at write time than storing a raw string. Whether that tradeoff makes sense depends on your dataset and your priorities.</p>
<p>If your log messages follow highly repetitive patterns (structured application logs, Kubernetes events, access logs), the storage wins are large and the CPU overhead is bounded. If your messages are free-form or low-repetition, the compression gains shrink while the CPU cost stays roughly the same.</p>
<p>For data you keep for months or years, the cumulative storage reduction usually makes it worthwhile. For high-cardinality, rapidly changing messages where storage isn't the constraint, it may not be.</p>
<p>9.3 also brought compression for binary doc values, making <code>wildcard</code> field types significantly more storage-efficient. Internally, wildcard fields store an inverted index of trigrams in a binary doc values column; that column is now compressed with Zstandard instead of being stored raw. In one benchmark, a URL field dropped from 2.92 GB to 1.12 GB, more than 60% compression. If you use <code>wildcard</code> fields heavily, the gain is automatic with no mapping changes needed.</p>
<p>Also in 9.3, skip lists for <code>@timestamp</code> and <code>host.name</code> became available as an opt-in for LogsDB. Skip lists let Elasticsearch jump ahead in a doc values column without reading every entry, which speeds up time-range queries on large segments. Other index modes have skip lists disabled by default; in LogsDB you can enable them selectively for the fields you range-query most.</p>
<p>Also in 9.3, the <a href="https://www.elastic.co/docs/manage-data/data-store/data-streams/failure-store">Failure Store</a> <a href="https://github.com/elastic/elasticsearch/pull/131261">became enabled by default</a> for <code>logs-*-*</code> data streams. Failed documents (mapping conflicts, ingest pipeline errors) now land in dedicated <code>::failures</code> indices instead of being rejected, which means LogsDB's strict synthetic source requirements are less likely to cause silent data loss during migration.</p>
<h2>Performance, not just storage</h2>
<p>LogsDB started as a storage optimization, and the early releases came with a throughput cost — sorting, synthetic source reconstruction, and ZSTD all add work at write time. Over two years of releases, we clawed that back. Indexing throughput is now on par with what users had before enabling LogsDB. You get the storage reduction without giving up the ingest rate you were used to.</p>
&lt;div align=&quot;center&quot;&gt;
![LogsDB throughput and storage on disk over time](/assets/images/elasticsearch-logsdb-storage-evolution/performance-over-time-bold@2x.png)
_Throughput (teal) has climbed from ~25k to ~35k docs/s since the Tech Preview. Storage on disk (blue) has dropped from ~65 GB to ~36 GB on the same benchmark dataset. Both curves move in the right direction, driven by the same layered releases: ZSTD in 8.16, routing optimization in 8.18, the merge and recovery overhaul in 9.1. Live numbers at [elasticsearch-benchmarks.elastic.co](https://elasticsearch-benchmarks.elastic.co/#tracks/logsdb/nightly/default/90d)._
&lt;/div&gt;
<p>The two trends compound each other. Less storage means fewer segments to merge, which frees CPU for indexing. Synthetic source reconstruction is cheaper to compute than it is to store and replicate the raw blob. Each release that shrank the index also reduced background I/O, which fed back into throughput.</p>
<p>The practical result: if you were running standard Elasticsearch for log ingestion two years ago, the throughput you had then is roughly what LogsDB delivers now — with a 50–75% smaller index alongside it.</p>
<h2>How to enable it</h2>
<p>As of 9.0, <code>logs-*-*</code> data streams default to LogsDB automatically. If your data streams match that pattern, you're already using it.</p>
<blockquote>
<p><strong>Want a hands-on walkthrough?</strong> <a href="https://www.elastic.co/observability-labs/blog/elasticsearch-logsdb-index-mode-storage-savings"><em>Cut Elasticsearch log storage costs by 76% with LogsDB</em></a> walks through creating two indices, reindexing, and measuring the difference with the <code>_stats</code> API — including version-specific enable instructions for 8.x clusters.</p>
</blockquote>
<p>For other index patterns, set it in your template:</p>
<pre><code class="language-json">PUT _index_template/logs-template
{
  &quot;index_patterns&quot;: [&quot;logs-*&quot;],
  &quot;template&quot;: {
    &quot;settings&quot;: {
      &quot;index.mode&quot;: &quot;logsdb&quot;
    }
  }
}
</code></pre>
<p>Synthetic <code>_source</code> turns on automatically with <code>index.mode: logsdb</code>.</p>
<p>For the routing optimization (8.18+), add one more setting:</p>
<pre><code class="language-json">PUT _index_template/logs-template
{
  &quot;index_patterns&quot;: [&quot;logs-*&quot;],
  &quot;template&quot;: {
    &quot;settings&quot;: {
      &quot;index.mode&quot;: &quot;logsdb&quot;,
      &quot;index.logsdb.route_on_sort_fields&quot;: true
    }
  }
}
</code></pre>
<p>This routes shards by sort field values instead of <code>_id</code>, adding ~20% storage reduction at a 1–4% ingestion penalty. It requires at least two sort fields beyond <code>@timestamp</code> and auto-generated <code>_id</code>.</p>
<p>Switching an existing index to LogsDB requires a reindex. So does rolling back. There's no in-place conversion, so try it on new data streams first.</p>
<p>Storage improves further as segments merge — freshly written data compresses well, but merged segments compress even better.</p>
<h2>What's next</h2>
<p>Elasticsearch still carries some structural overhead from its search engine roots. <code>_id</code> and <code>_seq_no</code> are two examples: both consume meaningful disk space (on small documents they can account for more than half the index size), but neither is essential for log analytics workloads.</p>
<p>We've already taken the first step for TSDB: <a href="https://github.com/elastic/elasticsearch/pull/144026">PR #144026</a> eliminated stored <code>_id</code> bytes from TSDB indices by reconstructing the field on the fly from doc values, the same approach synthetic <code>_source</code> uses. We're exploring the same direction for LogsDB.</p>
<p><strong>9.4 and beyond.</strong> The architecture still has room to improve, and we're on it.</p>
<p>For the full reference, see the <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/logs-data-stream.html">logs data stream documentation</a>.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/elasticsearch-logsdb-storage-evolution/elasticsearch-logsdb-storage-evolution.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Reconciliation in Elastic Streams: A Robust Architecture Deep Dive]]></title>
            <link>https://www.elastic.co/observability-labs/blog/from-tangled-to-streamlined-how-we-made-streams-robust-by-using-reconciliation</link>
            <guid isPermaLink="false">from-tangled-to-streamlined-how-we-made-streams-robust-by-using-reconciliation</guid>
            <pubDate>Tue, 04 Nov 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how Elastic's engineering team refactored Streams using a reconciliation model inspired by Kubernetes & React to build a robust, extensible, and debuggable system.]]></description>
            <content:encoded><![CDATA[<p>Streams is a new, unified approach to data management in the Elastic Stack. It wraps a set of existing Elasticsearch building blocks—data streams, index templates, ingest pipelines, retention policies—into a single, coherent primitive: the Stream. Instead of configuring these parts individually and in the right order, users can now rely on Streams to orchestrate them safely and automatically. With a unified UI in Kibana and a simplified API, Streams reduces cognitive load, lowers the risk of misconfiguration, and supports more flexible workflows like late binding—where users can ingest data first and decide how to process and route it later.</p>
<p>But behind that clean user experience lies a fast-moving, evolving codebase. In this post, we’ll explore how we rethought its architecture to keep up with product demands—while laying the groundwork for future flexibility and scale.</p>
<p>Rapid experimentation often leads to messy code—but before shipping to customers, we have to ask: If this succeeds, can we continue evolving it?
That question puts code health front and center. To move fast in the long term, we need a foundation that supports iteration.</p>
<p>When I joined the Streams team about six months ago, the project was moving fast through uncharted territory amid high uncertainty. This combination of speed and uncertainty created the perfect conditions for, well, spaghetti code—crafted by some of our most senior engineers, doing their best with a recipe missing a few ingredients.</p>
<p>The code was pragmatic and effective: it did exactly what it needed to do. But it was becoming increasingly difficult to understand and extend. Related logic was scattered across many files, with little separation of concerns, making it difficult to safely identify where and how to introduce changes. And the project still had a long road ahead.</p>
<p>Recently, we undertook a refactor of the underlying architecture—not just to bring greater clarity and structure to the codebase, but to establish clear phases that make it easier to debug and evolve. Our primary goal was to build a foundation that would let us continue moving quickly and confidently.
As a secondary goal, we aimed to enable new capabilities like bulk updates, dry runs, and system diagnostics.</p>
<p>In this post, we’ll briefly explore the challenges that prompted a new approach, share the architectural patterns that inspired us, explain how the new design works under the hood, and highlight what it enables for the future.</p>
<h2>The Challenges We Faced</h2>
<p>Streams aims to be a declarative model for data management. Users describe how data should flow: where it should go, what processing should happen along the way, and which mappings should apply. Behind the scenes, each API request results in one or more Elasticsearch resources being changed.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/from-tangled-to-streamlined-how-we-made-streams-robust-by-using-reconciliation/mess.png" alt="An image evoking a tangled mess" /></p>
<p>Before the refactor, the underlying code was increasingly difficult to reason about. There was no clear lifecycle that each request followed. Data was loaded only when it happened to be needed, validation was scattered across different functions, and cascading changes—like child streams reacting to parent updates—were applied recursively and implicitly. Elasticsearch requests could happen at any point during a request.</p>
<p>This led to several key challenges:</p>
<ul>
<li>
<p><strong>No clear place for validation</strong><br />
Without a single, centralized validation step, engineers weren’t sure where to add new checks—or whether existing ones would even run reliably. Some validations happened early, others late.</p>
</li>
<li>
<p><strong>No clear picture of the overall system state</strong><br />
Because there was no way to manage the system state as a whole it was hard to reason about or validate the state. We couldn’t easily check whether a change was valid in the context of all other existing streams or dependencies.</p>
</li>
<li>
<p><strong>Unpredictable side effects</strong><br />
Since Elasticsearch operations could occur at different points in the flow, failures were harder to handle or roll back. We didn’t have a clear “commit point” where the changes were executed.</p>
</li>
<li>
<p><strong>Tangled stream logic</strong><br />
Logic for different types of streams was mixed together in shared code paths, often guarded by conditionals. This made it hard to isolate behavior, test individual types, or add new ones without risking unintended consequences.</p>
</li>
</ul>
<p>These challenges made it clear: we needed a more structured foundation, one capable of supporting both the current complexity and future growth.</p>
<h2>What We Needed to Move Forward</h2>
<p>To move faster yet with confidence, we needed a foundation that could evolve gracefully, make behavior easier to reason about, and reduce the likelihood of unexpected side effects.</p>
<p>We aligned around a few key goals:</p>
<ul>
<li>
<p><strong>A clear request lifecycle</strong><br />
Each request should move through clear, well-defined phases: loading the current state, applying changes, validating the resulting state, determining the Elasticsearch actions, and executing the actions. This structure would help engineers understand where things happen—and why.</p>
</li>
<li>
<p><strong>A unified state model</strong><br />
We wanted a clear model of desired vs. current state—a single place to reason about the outcome of a change. This would enable safer validation, more efficient updates, and easier debugging by allowing us to compute the difference between the two states.</p>
</li>
<li>
<p><strong>A single commit point</strong><br />
All Elasticsearch changes should happen in one place, after everything’s validated and we know exactly what needs to change. This would reduce side effects, make failures easier to manage, and unlock support for dry runs.</p>
</li>
<li>
<p><strong>Isolated stream logic</strong><br />
We needed clearer separation between stream types so each could be developed and tested in isolation. This would simplify adding new types, reduce unintended side effects, and clarify whether changes belong to a stream type or the state management layer.</p>
</li>
<li>
<p><strong>Bulk operations and system introspection</strong><br />
Finally, we wanted to support features like bulk updates, dry runs, and health diagnostics—capabilities that were difficult or impossible with the old design. A more explicit and inspectable model of system state would make this possible.</p>
</li>
</ul>
<p>These goals became our north star as we explored new architectural patterns to get there, with a strong focus on comparing the current state with the desired state.</p>
<h2>Where We Drew Inspiration From</h2>
<p>Our new design drew inspiration from two well-known open source projects: <a href="https://kubernetes.io/">Kubernetes</a> and <a href="https://react.dev/">React</a>. Though very different, both share a central concept: reconciliation.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/from-tangled-to-streamlined-how-we-made-streams-robust-by-using-reconciliation/reconciliation.png" alt="An image showing a flow chart for reconciliation" /></p>
<p>Reconciliation means comparing two states, calculating their differences, and taking the necessary actions to move the system from its current state to its desired state.</p>
<ul>
<li>
<p>In <a href="https://kubernetes.io/docs/concepts/architecture/controller/">Kubernetes</a>, you declare the desired state of your resources, and the controller continuously works to align the cluster with that state.</p>
</li>
<li>
<p>In <a href="https://legacy.reactjs.org/docs/faq-internals.html">React</a>, each component defines how it should render, and the virtual DOM updates the real DOM efficiently to match that.</p>
</li>
</ul>
<p>We were also inspired by the <a href="https://mmapped.blog/posts/29-plan-execute">Plan/Execute</a> pattern which aims to separate decision making from execution. This sounded like what we needed in order to perform all validations before committing to any actions—ensuring we could reason about and inspect the system's intent ahead of time.</p>
<p>These concepts resonated with what we needed. It made clear that we required two key pieces:</p>
<ol>
<li>
<p>A model representing system state, responsible for comparing states and driving the overall workflow (like the Kubernetes controller loop).</p>
</li>
<li>
<p>A representation of individual streams that make up that state, handling the specific logic for each stream type (like React components).</p>
</li>
</ol>
<p>Each Stream is defined and stored in Elasticsearch. We recognized a disconnect between data management and state changes in our existing code, so we designed each stream to manage both. This fits naturally with the <a href="https://www.martinfowler.com/eaaCatalog/activeRecord.html">Active Record pattern</a>, where a class encapsulates both domain logic and persistence.</p>
<p>To make the system easier to extend and the state model’s interface simpler, we implemented an abstract Active Record class using the <a href="https://refactoring.guru/design-patterns/template-method">Template Method pattern</a>, clearly defining the interface new stream types must follow.</p>
<p>We did have some concerns that adopting these more advanced patterns—like reconciliation, the Active Record, and Template Method—might make it harder for new or less experienced engineers to get up to speed. While the code would be cleaner and more straightforward for those familiar with the patterns, we worried it could create a barrier for juniors or newcomers unfamiliar with these concepts.</p>
<p>In practice, however, we found the opposite: the code became easier to follow because the patterns provided a clear, consistent structure. More importantly, the architectural choices helped keep the focus on the domain itself, rather than on complex implementation details, making it more approachable for the whole team. The patterns are there but the code doesn't talk about them, it talks about the domain.</p>
<h2>How We Structured the System</h2>
<p>When a request hits one of our API endpoints in Kibana, the handler performs basic request validation, then passes the request to the Streams Client. The client’s job is to translate the request into one or more Change objects. Each Change represents the creation, modification, or deletion of a Stream.</p>
<p>These Change objects are then passed to a central class we introduced called <code>State</code>, which plays two key roles:</p>
<ul>
<li>
<p>It holds the set of Stream instances that make up the current version of the system.</p>
</li>
<li>
<p>It orchestrates the pipeline that applies changes and transitions from one state to another.</p>
</li>
</ul>
<p>Let’s walk through the key phases the State class manages when applying a change.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/from-tangled-to-streamlined-how-we-made-streams-robust-by-using-reconciliation/flow.png" alt="Flowchart of the phases" /></p>
<h3>Loading the Starting State</h3>
<p>First, the State class loads the current system state by reading the stored Stream definitions from Elasticsearch. This becomes our reference point for all subsequent comparisons—used during validation, diffing, and action planning.</p>
<h3>Applying Changes</h3>
<p>We begin by cloning the starting state. Each Stream is responsible for cloning itself.
Then we process each incoming Change:</p>
<ul>
<li>
<p>The change is presented to all Streams in the current state (creating a new one if needed).</p>
</li>
<li>
<p>Each Stream can react by updating itself and optionally emitting cascading changes—additional changes that ripple through related Streams.</p>
</li>
<li>
<p>Cascading changes are processed in a loop until no more are generated (or until we hit a safety threshold).</p>
</li>
</ul>
<p>We then move to the next requested Change.<br />
If any requested or cascading Change cannot be applied safely, the system aborts the entire request to prevent partial updates.</p>
<h3>Validating the Desired State</h3>
<p>Once we’ve applied all Changes and cascading effects, we run validations to ensure the resulting configuration is safe and consistent.</p>
<p>Each Stream is asked to validate itself in the context of the full desired state and the original starting state. This allows for both localized checks (within a Stream) and broader coordination (between related Streams). If any validation fails, we abort the request.</p>
<h3>Determining Actions</h3>
<p>Next, each Stream is asked to determine what Elasticsearch actions are needed to move from the starting state to the desired state. This is the first point where the system needs to consider which Elasticsearch resources back an individual Stream.</p>
<p>If the request is a dry run, we stop here and return a summary of what would happen. If it’s meant to be executed, we move to the next phase.</p>
<h3>Planning and Execution</h3>
<p>The list of Elasticsearch actions is handed off to a dedicated class called <code>ExecutionPlan</code>. This class handles:</p>
<ul>
<li>
<p>Resolving cross-stream dependencies that individual Streams cannot address alone.</p>
</li>
<li>
<p>Organizing the actions into the correct order to ensure safe application (e.g. to avoid data loss when routing rules change).</p>
</li>
<li>
<p>Maximizing parallelism wherever possible within those ordering constraints.</p>
</li>
</ul>
<p>If the plan executes successfully, we return a success response from the API.</p>
<h3>Handling Failures</h3>
<p>If the plan fails during execution, the <code>State</code> class attempts a roll back—it computes a new plan that should return the system to its starting state (by going from desired state to starting state instead) and tries to execute it.</p>
<p>If the roll back also fails, we have a fallback mechanism: a “reset” operation that re-applies the known-good state stored in Elasticsearch, skipping diffing entirely.</p>
<h3>A Closer Look at the Stream Active Record Classes</h3>
<p>All Streams in the State are subclasses of an abstract class called <code>StreamActiveRecord</code>. This class is responsible for:</p>
<ul>
<li>
<p>Tracking the change status of the Stream</p>
</li>
<li>
<p>Routing change application, validation, and action determination to specialized template method hooks implemented by its concrete subclasses based on the change status.</p>
</li>
</ul>
<p>These hooks are as follows:</p>
<ul>
<li>
<p>Apply upsert / Apply deletion</p>
</li>
<li>
<p>Validate upsert / Validate deletion</p>
</li>
<li>
<p>Determine actions for creation / change / deletion</p>
</li>
</ul>
<p>With this architecture in place, we’ve created a clear, phased, and declarative flow from input to action—one that’s modular, testable, and resilient to failure. It cleanly separates generic stream lifecycle logic (like change tracking and orchestration) from stream-specific behaviors (such as what “upsert” means for a given Stream type), enabling a highly extensible system. This structure allows us to isolate side effects, validate with confidence, and reason more clearly about system-wide behavior—all while supporting dry runs and bulk operations.</p>
<p>Now that we’ve covered how it works, let’s explore what this unlocks—the capabilities, safety guarantees, and new workflows this design makes possible.</p>
<h2>What This Unlocks</h2>
<p>The reconciliation based design we landed on isn’t just easier to reason about—it directly addresses many of the core limitations we faced in the earlier version of the system.</p>
<p><strong>Bulk operations and dry runs, by design</strong></p>
<p>One of our key goals was to support bulk configuration changes across many Streams in a single request. The previous codebase made this difficult because the side effects were interleaved with decision-making logic, making it risky to apply multiple changes at once.</p>
<p>Now, bulk changes are the default. The <code>State</code> class handles any number of changes, tracks cascading effects automatically, and validates the end result as a whole. Whether you're updating one Stream or fifty, the pipeline handles it consistently.</p>
<p>Dry runs were another desired feature. Because actions are now computed in a side-effect-free step—before anything is sent to Elasticsearch—we can generate a full preview of what would happen. This includes both which Streams would change and what specific Elasticsearch operations would be performed. That visibility helps users and developers make confident, informed decisions.</p>
<p><strong>Easier debugging, better diagnostics</strong></p>
<p>In the old system, debugging required reconstructing the execution context and piecing together side effects. Now, every phase of the pipeline is explicit and testable in isolation by following the phases.</p>
<p>Because validation and Elasticsearch actions are now tied directly to the Stream definition and lifecycle, any inconsistencies or errors are easier to trace to their source.</p>
<p><strong>Validated planning before execution</strong></p>
<p>Because we now validate and plan <em>before</em> making any changes, the risk of leaving the system in an inconsistent or partially-updated state has been greatly reduced. All actions are determined in advance, and only executed once we’re confident the entire set of changes is valid and coherent.</p>
<p>And if something does go wrong during execution, we can lean on the fact that both the starting and desired states are fully modeled in memory. This allows us to generate a roll back plan automatically, and when that’s not possible, fall back to a complete reset from the stored state. In short: safety is now built in, not bolted on.</p>
<p><strong>Extensible by default</strong></p>
<p>Adding a new type of Stream used to mean editing logic scattered across multiple files.
Now, it’s a focused, well-defined task. You subclass <code>StreamActiveRecord</code> and implement the handful of lifecycle hooks.</p>
<p>That’s it. The orchestration, tracking, and dependency handling are already wired up. That also means it’s easier to onboard new developers or experiment with new Stream types without fear of breaking unrelated parts of the system.</p>
<p><strong>Easier to test</strong></p>
<p>Because each Stream is now encapsulated and has clear, isolated responsibilities, testing is much simpler. You can test individual Stream classes by simulating specific inputs and asserting the resulting cascading changes, validation results, or Elasticsearch actions. There's no need to spin up a full end-to-end environment just to test a single validation.</p>
<h2>What’s Next</h2>
<p>At Elastic, we live by our Source Code, which states “Progress, SIMPLE Perfection”—a reminder to favor steady, incremental improvement over chasing perfection.</p>
<p>This new system is a solid foundation—but it’s only the beginning. Our focus so far has been on clarity, safety, and extensibility, and while we’ve addressed some long-standing pain points, there’s still plenty of room to evolve.</p>
<h3>Continuous improvement ahead</h3>
<p>We intentionally shipped this work with a sharp scope and have already identified several enhancements that we will be adding in the coming weeks:</p>
<ul>
<li>
<p><strong>Introduce a locking layer</strong><br />
To safely handle concurrent updates, we plan to introduce a locking mechanism that prevents race conditions during parallel modifications.</p>
</li>
<li>
<p><strong>Expose bulk and dry-run features via our APIs</strong><br />
The <code>State</code> class already supports them—now it’s time to make those capabilities available to users.</p>
</li>
<li>
<p><strong>Improve debugging output</strong><br />
Now that state transitions are modeled explicitly, we can expose clearer diagnostics to help both users and developers reason about changes.</p>
</li>
<li>
<p><strong>Avoiding Redundant Elasticsearch Requests</strong><br />
Currently we make multiple redundant requests during validation. Introducing a lightweight in-memory cache would let us avoid reloading the same resource more than once.</p>
</li>
<li>
<p><strong>Improve access controls</strong><br />
Currently, we rely on Elasticsearch to enforce access control. Because a single change can touch many different resources, it’s difficult to determine up front which privileges are required. We plan to extend our action definitions with privilege metadata, enabling us to validate the full set of required permissions before executing any actions. This will let us detect and report missing privileges early—before the plan runs.</p>
</li>
<li>
<p><strong>Add APM instrumentation</strong><br />
With the system structured in distinct, well-defined phases, we’re now in a great position to add performance instrumentation. This will help us identify bottlenecks and improve responsiveness over time.</p>
</li>
</ul>
<h3>Revisiting responsibilities</h3>
<p>As our orchestration becomes more robust, we’re also re-evaluating where it should live. Large-scale bulk operations, for example, might eventually be better handled closer to Elasticsearch itself, where we can benefit from greater atomicity and tighter performance guarantees. That kind of deep integration would have been premature earlier on—when we were still figuring out the right abstractions and phases for the system. But now that the design has stabilized, we’re in a much better position to start that conversation.</p>
<h3>Built to evolve</h3>
<p>We designed this system with adaptability in mind. Whether improvements come in the form of internal refactors, better developer experience, or deeper collaboration with Elasticsearch, we’re in a strong position to keep evolving. The architecture is modular by design—and that gives us both the stability to rely on and the flexibility to grow.</p>
<h2>Wrapping Up</h2>
<p>Building robust, maintainable systems is never just about code — it’s about aligning architecture with the evolving needs and direction of the product. Our journey refactoring Streams reaffirmed that a thoughtful, phased approach not only improves technical clarity but also empowers teams to move faster and innovate more confidently.</p>
<p>If you’re working on complex systems facing similar challenges—whether tangled logic, unpredictable side effects, or the need for extensibility—you’re not alone. We hope our story offers some useful insights and inspiration as you shape your own path forward.</p>
<p>We welcome feedback and collaboration from the community—whether it’s in the form of questions, ideas, or code.</p>
<p>To learn more about Streams, explore:</p>
<p><em>Read about</em> <a href="https://www.elastic.co/observability-labs/blog/reimagine-observability-elastic-streams"><em>Reimagining streams</em></a></p>
<p><em>Look at the</em> <a href="http://elastic.co/elasticsearch/streams"><em>Streams website</em></a></p>
<p><em>Read the</em> <a href="https://www.elastic.co/docs/solutions/observability/streams/streams"><em>Streams documentation</em></a></p>
<p><em>Check out the</em> <a href="https://github.com/elastic/kibana/pull/211696"><em>pull request on GitHub</em></a> to dive into the code or join the conversation.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/from-tangled-to-streamlined-how-we-made-streams-robust-by-using-reconciliation/article.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Future-proof your logs with ecs@mappings template]]></title>
            <link>https://www.elastic.co/observability-labs/blog/future-proof-your-logs-with-ecs-mappings-template</link>
            <guid isPermaLink="false">future-proof-your-logs-with-ecs-mappings-template</guid>
            <pubDate>Mon, 23 Sep 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Explore how the ecs@mappings component template in Elasticsearch simplifies data management by providing a centralized, official definition of Elastic Common Schema (ECS) mappings. Learn about its benefits, including reduced configuration hassles, improved data integrity, and enhanced performance for both integration developers and community users. Discover how this feature streamlines ECS field support across Elastic Agent integrations and future-proofs your data streams.]]></description>
            <content:encoded><![CDATA[<p>As the Elasticsearch ecosystem evolves, so do the tools and methodologies designed to streamline data management. One advancement that will significantly benefit our community is the <a href="https://github.com/elastic/elasticsearch/blob/v8.15.1/x-pack/plugin/core/template-resources/src/main/resources/ecs%40mappings.json">ecs@mappings</a> component template.</p>
<p><a href="https://www.elastic.co/guide/en/ecs/current/ecs-reference.html">ECS (Elastic Common Schema)</a> is a standardized data model for logs and metrics. It defines a set of <a href="https://www.elastic.co/guide/en/ecs/current/ecs-field-reference.html">common field names and data types</a> that help ensure consistency and compatibility.</p>
<p><code>ecs@mappings</code> is a <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-component-template.html">component template</a> that offers an <a href="https://github.com/elastic/elasticsearch/blob/v8.15.1/x-pack/plugin/core/template-resources/src/main/resources/ecs%40mappings.json">Elastic-maintained</a> definition of ECS mappings. Each Elasticsearch release contains an always up-to-date definition of all ECS fields.</p>
<h3>Elastic Common Schema and Open Telemetry</h3>
<p>Elastic will preserve our user's investment in Elastic Common Schema by <a href="https://www.elastic.co/blog/ecs-elastic-common-schema-otel-opentelemetry-faq">donating</a> ECS to Open Telemetry. Elastic participates and collaborates with the OTel community to merge ECS and Open Telemetry's Semantic Conventions over time.</p>
<h2>The Evolution of ECS Mappings</h2>
<p>Historically, users and integration developers have defined ECS (Elastic Common Schema) mappings manually within individual index templates and packages, each meticulously listing its fields. Although straightforward, this approach proved time-consuming and challenging to maintain.</p>
<p>To tackle this challenge, integration developers moved towards two primary methodologies:</p>
<ol>
<li>Referencing ECS mappings</li>
<li>Importing ECS mappings directly</li>
</ol>
<p>These methods were steps in the right direction but introduced their challenges, such as the maintenance cost of keeping the ECS mappings up-to-date with Elasticsearch changes.</p>
<h2>Enter ecs@mappings</h2>
<p>The <a href="https://github.com/elastic/elasticsearch/blob/v8.15.1/x-pack/plugin/core/template-resources/src/main/resources/ecs%40mappings.json">ecs@mappings</a> component template supports all the field definitions in ECS, leveraging naming conventions and a set of dynamic templates.</p>
<p>Elastic started shipping the <code>ecs@mappings</code> component template with Elasticsearch v8.9.0, including it in the <a href="https://github.com/elastic/elasticsearch/blob/v8.14.2/x-pack/plugin/core/template-resources/src/main/resources/logs%40template.json">logs-<em>-</em> index template</a>.</p>
<p>With Elasticsearch v8.13.0, Elastic now includes <code>ecs@mappings</code> in the index templates of all the Elastic Agent integrations.</p>
<p>This move was a breakthrough because:</p>
<ul>
<li><strong>Centralized</strong> and official: With ecs@mappings, we now have an official definition of ECS mappings.</li>
<li><strong>Out-of-the-box functionality</strong>: ECS mappings are readily available, reducing the need for additional imports or references.</li>
<li><strong>Simplified maintenance</strong>: The need to manually keep up with ECS changes has diminished since the template from Elasticsearch itself remains up-to-date.</li>
</ul>
<h3>Enhanced Consistency and Reliability</h3>
<p>With <code>ecs@mappings</code>, ECS mappings become the single source of truth. This unified approach means fewer discrepancies and higher consistency in data streams across integrations.</p>
<h2>How Community Users Benefit</h2>
<p>Community users stand to gain manifold from the adoption of <code>ecs@mappings</code>. Here are the key advantages:</p>
<ol>
<li><strong>Reduced configuration hassles</strong>: Whether you are an advanced user or just getting started, the simplified setup means fewer configuration steps and fewer opportunities for errors.</li>
<li><strong>Improved data integrity</strong>: Since ecs@mappings ensures that field definitions are accurate and up-to-date, data integrity is maintained effortlessly.</li>
<li><strong>Better performance</strong>: With less overhead in maintaining and referencing ECS fields, your Elasticsearch operations run more smoothly.</li>
<li><strong>Enhanced documentation and discoverability</strong>: As we standardize ECS mappings, the documentation can be centralized, making it easier for users to discover and understand ECS fields.</li>
</ol>
<p>Let's explore how the <code>ecs@mappings</code> component template helps users achieve these benefits.</p>
<h3>Reduced configuration hassles</h3>
<p>Modern Elasticsearch versions come with out-of-the-box full ECS field support (see the “requirements” section later for specific versions).</p>
<p>For example, the <a href="https://docs.elastic.co/integrations/aws_logs">Custom AWS Logs integration</a> installed on a supported Elasticsearch cluster already includes the <code>ecs@mappings</code> component template in its index template:</p>
<pre><code class="language-json">GET _index_template/logs-aws_logs.generic
{
  &quot;index_templates&quot;: [
    {
      &quot;name&quot;: &quot;logs-aws_logs.generic&quot;,
      ...,
        &quot;composed_of&quot;: [
          &quot;logs@settings&quot;,
          &quot;logs-aws_logs.generic@package&quot;,
          &quot;logs-aws_logs.generic@custom&quot;,
          &quot;ecs@mappings&quot;,
          &quot;.fleet_globals-1&quot;,
          &quot;.fleet_agent_id_verification-1&quot;
        ],
    ...
</code></pre>
<p>There is no need to import or define any ECS field.</p>
<h3>Improved data integrity</h3>
<p>The <code>ecs@mappings</code> component template supports all the existing ECS fields. If you use any ECS field in your document, it will accurately have the expected type.</p>
<p>To ensure that <code>ecs@mappings</code> is always up to date with the <a href="https://github.com/elastic/ecs/">ECS repository</a>, we set up a daily <a href="https://github.com/elastic/elasticsearch/blob/6ae9dbfda7d71ae3f1bd2bddf9334d37b3294632/x-pack/plugin/stack/src/javaRestTest/java/org/elasticsearch/xpack/stack/EcsDynamicTemplatesIT.java#L49">automated test</a> to ensure that the component template supports all fields.</p>
<h3>Better Performance</h3>
<h4>Compact definitions</h4>
<p>The ECS field definition is exceptionally compact; at the time of this writing, it is 228 lines long and supports all ECS fields. To learn more, see the <code>ecs@mappings</code> component template <a href="https://github.com/elastic/elasticsearch/blob/v8.15.1/x-pack/plugin/core/template-resources/src/main/resources/ecs%40mappings.json">source code</a>.</p>
<p>It relies on naming conventions and uses <a href="https://www.elastic.co/guide/en/elasticsearch/reference/8.14/dynamic-templates.html">dynamic templates</a> to achieve this compactness.</p>
<h4>Lazy mapping</h4>
<p>Elasticsearch only adds existing document fields to the mapping, thanks to dynamic templates. The lazy mapping keeps memory overhead at a minimum, improving cluster performance and making field suggestions more relevant.</p>
<h3>Enhanced documentation and discoverability</h3>
<p>All Elastic Agent integrations are migrating to the <code>ecs@mappings</code> component template. These integrations no longer need to add and maintain ECS field mappings and can reference the official <a href="https://www.elastic.co/guide/en/ecs/current/ecs-field-reference.html">ECS Field Reference</a> or the ECS source code in the Git repository: <a href="https://github.com/elastic/ecs/">https://github.com/elastic/ecs/</a>.</p>
<h2>Getting started</h2>
<h3>Requirements</h3>
<p>To leverage the <code>ecs@mappings</code> component template, ensure the following stack version:</p>
<ul>
<li><strong>8.9.0</strong>: if your data stream uses the logs index template or you define your index template.</li>
<li><strong>8.13.0</strong>: if your data stream uses the index template of an Elastic Agent integration.</li>
</ul>
<h3>Example</h3>
<p>We will use the <a href="https://docs.elastic.co/integrations/aws_logs">Custom AWS Logs integration</a> to show you how <code>ecs@mapping</code> can handle mapping for any out-of-the-box ECS field.</p>
<p>Imagine you want to ingest the following log event using the Custom AWS Logs integration:</p>
<pre><code class="language-json">{
  &quot;@timestamp&quot;: &quot;2024-06-11T13:16:00+02:00&quot;, 
  &quot;command_line&quot;: &quot;ls -ltr&quot;,
  &quot;custom_score&quot;: 42
}
</code></pre>
<h4>Dev Tools</h4>
<p>Kibana offers an excellent tool for experimenting with Elasticseatch API, the <a href="https://www.elastic.co/guide/en/kibana/current/console-kibana.html">Dev Tools console</a>. With the Dev Tools, users can run all API requests quickly and without much friction.</p>
<p>To open the Dev Tools:</p>
<ul>
<li>Open <strong>Kibana</strong></li>
<li>Select <strong>Management &gt; Dev Tools &gt; Console</strong></li>
</ul>
<h4>Elasticsearch version &lt; 8.13</h4>
<p>On Elasticsearch versions before 8.13, the Custom AWS Logs integration has the following index template:</p>
<pre><code class="language-json">GET _index_template/logs-aws_logs.generic
{
  &quot;index_templates&quot;: [
    {
      &quot;name&quot;: &quot;logs-aws_logs.generic&quot;,
      &quot;index_template&quot;: {
        &quot;index_patterns&quot;: [
          &quot;logs-aws_logs.generic-*&quot;
        ],
        &quot;template&quot;: {
          &quot;settings&quot;: {},
          &quot;mappings&quot;: {
            &quot;_meta&quot;: {
              &quot;package&quot;: {
                &quot;name&quot;: &quot;aws_logs&quot;
              },
              &quot;managed_by&quot;: &quot;fleet&quot;,
              &quot;managed&quot;: true
            }
          }
        },
        &quot;composed_of&quot;: [
          &quot;logs-aws_logs.generic@package&quot;,
          &quot;logs-aws_logs.generic@custom&quot;,
          &quot;.fleet_globals-1&quot;,
          &quot;.fleet_agent_id_verification-1&quot;
        ],
        &quot;priority&quot;: 200,
        &quot;_meta&quot;: {
          &quot;package&quot;: {
            &quot;name&quot;: &quot;aws_logs&quot;
          },
          &quot;managed_by&quot;: &quot;fleet&quot;,
          &quot;managed&quot;: true
        },
        &quot;data_stream&quot;: {
          &quot;hidden&quot;: false,
          &quot;allow_custom_routing&quot;: false
        }
      }
    }
  ]
}
</code></pre>
<p>As you can see, it does not include the ecs@mappings component template.</p>
<p>If we try to index the test document:</p>
<pre><code class="language-json">POST logs-aws_logs.generic-default/_doc
{
  &quot;@timestamp&quot;: &quot;2024-06-11T13:16:00+02:00&quot;, 
  &quot;command_line&quot;: &quot;ls -ltr&quot;,
  &quot;custom_score&quot;: 42
}
</code></pre>
<p>The data stream will have the following mappings:</p>
<pre><code>GET logs-aws_logs.generic-default/_mapping/field/command_line
{
  &quot;.ds-logs-aws_logs.generic-default-2024.06.11-000001&quot;: {
    &quot;mappings&quot;: {
      &quot;command_line&quot;: {
        &quot;full_name&quot;: &quot;command_line&quot;,
        &quot;mapping&quot;: {
          &quot;command_line&quot;: {
            &quot;type&quot;: &quot;keyword&quot;,
            &quot;ignore_above&quot;: 1024
          }
        }
      }
    }
  }
}

GET logs-aws_logs.generic-default/_mapping/field/custom_score
{
  &quot;.ds-logs-aws_logs.generic-default-2024.06.11-000001&quot;: {
    &quot;mappings&quot;: {
      &quot;custom_score&quot;: {
        &quot;full_name&quot;: &quot;custom_score&quot;,
        &quot;mapping&quot;: {
          &quot;custom_score&quot;: {
            &quot;type&quot;: &quot;long&quot;
          }
        }
      }
    }
  }
}
</code></pre>
<p>These mappings do not align with ECS, so users and developers had to maintain them.</p>
<h4>Elasticsearch version &gt;= 8.13</h4>
<p>On Elasticsearch versions equal to or newer to 8.13, the Custom AWS Logs integration has the following index template:</p>
<pre><code class="language-json">GET _index_template/logs-aws_logs.generic
{
  &quot;index_templates&quot;: [
    {
      &quot;name&quot;: &quot;logs-aws_logs.generic&quot;,
      &quot;index_template&quot;: {
        &quot;index_patterns&quot;: [
          &quot;logs-aws_logs.generic-*&quot;
        ],
        &quot;template&quot;: {
          &quot;settings&quot;: {},
          &quot;mappings&quot;: {
            &quot;_meta&quot;: {
              &quot;package&quot;: {
                &quot;name&quot;: &quot;aws_logs&quot;
              },
              &quot;managed_by&quot;: &quot;fleet&quot;,
              &quot;managed&quot;: true
            }
          }
        },
        &quot;composed_of&quot;: [
          &quot;logs@settings&quot;,
          &quot;logs-aws_logs.generic@package&quot;,
          &quot;logs-aws_logs.generic@custom&quot;,
          &quot;ecs@mappings&quot;,
          &quot;.fleet_globals-1&quot;,
          &quot;.fleet_agent_id_verification-1&quot;
        ],
        &quot;priority&quot;: 200,
        &quot;_meta&quot;: {
          &quot;package&quot;: {
            &quot;name&quot;: &quot;aws_logs&quot;
          },
          &quot;managed_by&quot;: &quot;fleet&quot;,
          &quot;managed&quot;: true
        },
        &quot;data_stream&quot;: {
          &quot;hidden&quot;: false,
          &quot;allow_custom_routing&quot;: false
        },
        &quot;ignore_missing_component_templates&quot;: [
          &quot;logs-aws_logs.generic@custom&quot;
        ]
      }
    }
  ]
}
</code></pre>
<p>The index template for <code>logs-aws_logs.generic</code> now includes the <code>ecs@mappings</code> component template.</p>
<p>If we try to index the test document:</p>
<pre><code class="language-json">POST logs-aws_logs.generic-default/_doc
{
  &quot;@timestamp&quot;: &quot;2024-06-11T13:16:00+02:00&quot;, 
  &quot;command_line&quot;: &quot;ls -ltr&quot;,
  &quot;custom_score&quot;: 42
}
</code></pre>
<p>The data stream will have the following mappings:</p>
<pre><code class="language-json">GET logs-aws_logs.generic-default/_mapping/field/command_line
{
  &quot;.ds-logs-aws_logs.generic-default-2024.06.11-000001&quot;: {
    &quot;mappings&quot;: {
      &quot;command_line&quot;: {
        &quot;full_name&quot;: &quot;command_line&quot;,
        &quot;mapping&quot;: {
          &quot;command_line&quot;: {
            &quot;type&quot;: &quot;wildcard&quot;,
            &quot;fields&quot;: {
              &quot;text&quot;: {
                &quot;type&quot;: &quot;match_only_text&quot;
              }
            }
          }
        }
      }
    }
  }
}

GET logs-aws_logs.generic-default/_mapping/field/custom_score
{
  &quot;.ds-logs-aws_logs.generic-default-2024.06.11-000001&quot;: {
    &quot;mappings&quot;: {
      &quot;custom_score&quot;: {
        &quot;full_name&quot;: &quot;custom_score&quot;,
        &quot;mapping&quot;: {
          &quot;custom_score&quot;: {
            &quot;type&quot;: &quot;float&quot;
          }
        }
      }
    }
  }
}
</code></pre>
<p>In Elasticsearch 8.13, fields like <code>command_line</code> and <code>custom_score</code> get their definition from ECS out-of-the-box.</p>
<p>These mappings align with ECS, so users and developers do not have to maintain them. The same applies to all the hundreds of field definitions in the Elastic Common Schema. You can achieve this by including a 200-liner component template in your data stream.</p>
<h2>Caveats</h2>
<p>Some aspects of how the ecs@mappings component template deals with data types are worth mentioning.</p>
<h3>ECS types are not enforced</h3>
<p>The <code>ecs@mappings</code> component template does not contain mappings for ECS fields where dynamic mapping already uses the correct field type. Therefore, if you send a field value with a compatible but wrong type, Elasticsearch will not coerce the value.</p>
<p>For example, if you send the following document with a faas.coldstart field (defined as boolean in ECS):</p>
<pre><code class="language-json">{
  &quot;faas.coldstart&quot;: &quot;true&quot;
}
</code></pre>
<p>Elasticsearch will map <code>faas.coldstart</code> as a <code>keyword</code> and not a <code>boolean</code>. Therefore, you need to make sure that the values you ingest to Elasticsearch use the right JSON field types, according to how they’re defined in ECS.</p>
<p>This is the tradeoff for having a compact and efficient ecs@mappings component template. It also allows for better compatibility when dealing with a mix of ECS and custom fields because documents won’t be rejected if the types are not consistent with the ones defined in ECS.</p>
<h2>Conclusion</h2>
<p>The introduction of <code>ecs@mappings</code> marks a significant improvement in managing ECS mappings within Elasticsearch. By centralizing and streamlining these definitions, we can ensure higher consistency, reduced maintenance, and better overall performance.</p>
<p>Whether you're an integration developer or a community user, moving to <code>ecs@mappings</code> represents a step towards more efficient and reliable Elasticsearch operations. As we continue incorporating feedback and evolving our tools, your journey with Elasticsearch will only get smoother and more rewarding.</p>
<p><strong>Join the Conversation</strong></p>
<p>Do you have questions or feedback about <code>ecs@mappings</code>? Post on our helpful community of users on our community <a href="https://discuss.elastic.co/">discussion forum</a> and <a href="https://ela.st/slack">Slack instance</a> and share your experiences. Your input is invaluable in helping us fine-tune these advancements for the entire community.</p>
<p>Happy mapping!</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/future-proof-your-logs-with-ecs-mappings-template/article.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Getting more from your logs with OpenTelemetry]]></title>
            <link>https://www.elastic.co/observability-labs/blog/getting-more-from-your-logs-with-opentelemetry</link>
            <guid isPermaLink="false">getting-more-from-your-logs-with-opentelemetry</guid>
            <pubDate>Thu, 11 Sep 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how to evolve beyond basic log ingest by leveraging OpenTelemetry for ingestion, structured logging, geographic enrichment, and ES|QL analytics. Transform raw log data into actionable intelligence with practical examples and proactive observability strategies.]]></description>
            <content:encoded><![CDATA[<h1>Getting more from your logs with OpenTelemetry</h1>
<p>Most people today use their logging tools mostly still in the same way we have for decades as a simple search lake, essentially still grepping for logs but from a centralized platform. There’s nothing wrong with this, you can get a lot of value by having a centralized logging platform but the question becomes how can I start to evolve beyond this basic log and search use case? Where can I start to be more effective with my incident investigations? In this blog we start from where most of our customers are today and give you some practical tips on how to move a little beyond this simple logging use case.</p>
<h2>Ingestion</h2>
<p>Let's start at the beginning, ingest. Typically many of you are using older tools for ingestion today. If you want to be more forward thinking here, it’s time to introduce you to OpenTelemetry. OpenTelemetry was once not very mature or capable for logging but things have changed significantly. Elastic has been working particularly hard to improve the log capabilities resident in OpenTelemetry. So let's start by exploring how we can get started bringing logs into Elastic via the OpenTelemetry collector.</p>
<p>Firstly if you want to follow along simply create a host to run the log generator and OpenTelemetry collector.</p>
<p>Follow the instructions here to get the log generator running:</p>
<p><a href="https://github.com/davidgeorgehope/log-generator-bin/">https://github.com/davidgeorgehope/log-generator-bin/</a></p>
<p>To get the OpenTelemetry collector up and running in <a href="https://cloud.elastic.co/serverless-registration?onboarding_token=observability">Elastic Serverless</a>, you can click on Add Data from the bottom left, then 'host' and finally 'opentelemetry'</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/getting-more-from-your-logs-with-opentelemetry/image14.png" alt="" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/getting-more-from-your-logs-with-opentelemetry/image7.png" alt="" /></p>
<p>Follow the instructions but don’t start the collector just yet.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/getting-more-from-your-logs-with-opentelemetry/image16.png" alt="" /></p>
<p>Our host here is running a 3 tier application with an Nginx frontend, backend and connected to a MySQL database. So let's start by bringing the logs into Elastic.</p>
<p>First we’ll install the Elastic Distributions for OpenTelemetry but before starting it, we will make a small change to the OpenTelemetry configuration file to expand the directories it will search for logs in.  Edit the otel.yml by simply using vi or your favorite editor:</p>
<pre><code class="language-bash">vi otel.yml
</code></pre>
<p>Instead of simply /var/log/.log we will add /var/log/**/*.log to bring in all our log files.</p>
<pre><code class="language-yaml">receivers:
  # Receiver for platform specific log files
  filelog/platformlogs:
    include: [ /var/log/**/*.log ]
    retry_on_failure:
      enabled: true
    start_at: end
    storage: file_storage
</code></pre>
<p>Start the otel collector</p>
<pre><code class="language-bash">sudo ./otelcol --config otel.yml
</code></pre>
<p>And we can see these are being brought in, in discover</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/getting-more-from-your-logs-with-opentelemetry/image8.png" alt="" /></p>
<p>Now one thing that is immediately noticeable is that we automatically without changing anything get a bunch of useful additional information such as the os name and cpu information.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/getting-more-from-your-logs-with-opentelemetry/image12.png" alt="" /></p>
<p>The OpenTelemetry collector has automatically, without any changes, started to enrich our logs, making it useful for additional processing, though we could do significantly better!</p>
<p>To start with we want to give our logs some structure. Lets edit that otel.yml file and add some OTTL to extract some key data from our NGINX logs.</p>
<pre><code class="language-yaml">  transform/parse_nginx:
    trace_statements: []
    metric_statements: []
    log_statements:
      - context: log
        conditions:
          - 'attributes[&quot;log.file.name&quot;] != nil and IsMatch(attributes[&quot;log.file.name&quot;], &quot;access.log&quot;)'
        statements:
          - merge_maps(attributes, ExtractPatterns(body, &quot;^(?P&lt;client_ip&gt;\\S+)&quot;), &quot;upsert&quot;)
          - merge_maps(attributes, ExtractPatterns(body, &quot;^\\S+ - (?P&lt;user&gt;\\S+)&quot;), &quot;upsert&quot;)
          - merge_maps(attributes, ExtractPatterns(body, &quot;\\[(?P&lt;timestamp_raw&gt;[^\\]]+)\\]&quot;), &quot;upsert&quot;)
          - merge_maps(attributes, ExtractPatterns(body, &quot;\&quot;(?P&lt;method&gt;\\S+) &quot;), &quot;upsert&quot;)
          - merge_maps(attributes, ExtractPatterns(body, &quot;\&quot;\\S+ (?P&lt;path&gt;\\S+)\\?&quot;), &quot;upsert&quot;)
          - merge_maps(attributes, ExtractPatterns(body, &quot;req_id=(?P&lt;req_id&gt;[^ ]+)&quot;), &quot;upsert&quot;)
          - merge_maps(attributes, ExtractPatterns(body, &quot;\&quot; (?P&lt;status&gt;\\d+) &quot;), &quot;upsert&quot;)
          - merge_maps(attributes, ExtractPatterns(body, &quot;\&quot; \\d+ (?P&lt;size&gt;\\d+)&quot;), &quot;upsert&quot;)
.....

   logs/platformlogs:
      receivers: [filelog/platformlogs]
      processors: [transform/parse_nginx,resourcedetection]
      exporters: [elasticsearch/otel]
</code></pre>
<p>Now when we start the Otel collector with this new configuration</p>
<pre><code class="language-bash">sudo ./otelcol --config otel.yml
</code></pre>
<p>We will see that we now have structured logs!!</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/getting-more-from-your-logs-with-opentelemetry/image17.png" alt="" /></p>
<h2>Store and Optimize</h2>
<p>To ensure you aren’t blowing your budget out with all this additional structured data there are few things you can do to help maximize storage efficiency.</p>
<p>You can use the filter processors in the Otel collector with granular filtering/dropping of irrelevant attributes to control volume going out of the collector for example.</p>
<pre><code class="language-yaml">processors:
  filter/drop_logs_without_user_attributes:
    logs:
      log_record:
        - 'attributes[&quot;user&quot;] == nil'
  filter/drop_200_logs:
    logs:
      log_record:
        - 'attributes[&quot;status&quot;] == &quot;200&quot;'

service:
  pipelines:
    logs/platformlogs:
      receivers: [filelog/platformlogs]
      processors: [transform/parse_nginx, filter/drop_logs_without_user_attributes, filter/drop_200_logs, resourcedetection]
      exporters: [elasticsearch/otel]
</code></pre>
<p>The filter processor will help reduce the noise for example if you wanted to drop the debug logs or logs from a noisy service. Great ways to keep a lid on your observability spend.</p>
<p>Additionally for your most critical flows and logs where you don’t want to drop any data, Elastic has you covered. In version 9.x of Elastic you now have LogsDB switched on by default.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/getting-more-from-your-logs-with-opentelemetry/image15.png" alt="" /></p>
<p>With LogsDB, Elastic has reduced the storage footprint of log data in Elasticsearch by up to 65% allowing you to store more observability and security data without exceeding your budget, while keeping all data accessible and searchable.</p>
<p>LogsDB reduces log storage by up to 65%. This dramatically minimizes storage footprints by leveraging advanced compression techniques like ZSTD, delta encoding, and run-length encoding, and it also reconstructs the _source field on demand, saving about 40% more storage by not retaining the original JSON document. Synthetic _source represents the introduction of columnar storage within Elasticsearch.</p>
<h2>Analytics</h2>
<p>So we have our data in Elastic, it’s structured, it conforms to the idea of a wide-event log since it has lots of good context, user ids, request ids and the data is captured at the start of a request Next we’re going to look at the analytics part of this. First let's take a stab at looking at the number of Errors for each user transaction in our application.</p>
<pre><code class="language-esql">FROM logs-generic.otel-default
| WHERE log.file.name == &quot;access.log&quot;
| WHERE attributes.status &gt;= &quot;400&quot;
| STATS error_count = COUNT(*) BY attributes.user
| SORT error_count DESC
</code></pre>
<p><img src="https://www.elastic.co/observability-labs/assets/images/getting-more-from-your-logs-with-opentelemetry/image9.png" alt="" /></p>
<p>It’s pretty easy now to save this and put it on a dashboard, we just click the save button:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/getting-more-from-your-logs-with-opentelemetry/image1.png" alt="" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/getting-more-from-your-logs-with-opentelemetry/image5.png" alt="" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/getting-more-from-your-logs-with-opentelemetry/image6.png" alt="" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/getting-more-from-your-logs-with-opentelemetry/image3.png" alt="" /></p>
<p>Next let's look at putting something together to show the global impact, first we will update our collector config to enrich our log data with geo location.</p>
<p>Update the OTTL configuration with this new line:</p>
<pre><code class="language-yaml">   log_statements:
      - context: log
        conditions:
          - 'attributes[&quot;log.file.name&quot;] != nil and IsMatch(attributes[&quot;log.file.name&quot;], &quot;access.log&quot;)'
        statements:
          - merge_maps(attributes, ExtractPatterns(body, &quot;^(?P&lt;client_ip&gt;\\S+)&quot;), &quot;upsert&quot;)
          - merge_maps(attributes, ExtractPatterns(body, &quot;^\\S+ - (?P&lt;user&gt;\\S+)&quot;), &quot;upsert&quot;)
          - merge_maps(attributes, ExtractPatterns(body, &quot;\\[(?P&lt;timestamp_raw&gt;[^\\]]+)\\]&quot;), &quot;upsert&quot;)
          - merge_maps(attributes, ExtractPatterns(body, &quot;\&quot;(?P&lt;method&gt;\\S+) &quot;), &quot;upsert&quot;)
          - merge_maps(attributes, ExtractPatterns(body, &quot;\&quot;\\S+ (?P&lt;path&gt;\\S+)\\?&quot;), &quot;upsert&quot;)
          - merge_maps(attributes, ExtractPatterns(body, &quot;req_id=(?P&lt;req_id&gt;[^ ]+)&quot;), &quot;upsert&quot;)
          - merge_maps(attributes, ExtractPatterns(body, &quot;\&quot; (?P&lt;status&gt;\\d+) &quot;), &quot;upsert&quot;)
          - merge_maps(attributes, ExtractPatterns(body, &quot;\&quot; \\d+ (?P&lt;size&gt;\\d+)&quot;), &quot;upsert&quot;)
          - set(attributes[&quot;source.address&quot;], attributes[&quot;client_ip&quot;]) where attributes[&quot;client_ip&quot;] != nil
</code></pre>
<p>Next add a new processor (you will need to download the GeoIP database from MaxMind)</p>
<pre><code class="language-yaml">geoip:
  context: record
  source:
    from: attributes
  providers:
    maxmind:
      database_path: /opt/geoip/GeoLite2-City.mmdb
</code></pre>
<p>And add this to the log pipeline after the parse_nginx</p>
<pre><code class="language-yaml">service:
  pipelines:
    logs/platformlogs:
      receivers: [filelog/platformlogs]
      processors: [transform/parse_nginx, geoip, resourcedetection]
      exporters: [elasticsearch/otel]
</code></pre>
<p>Start the otel collector</p>
<pre><code class="language-bash">sudo ./otelcol --config otel.yml
</code></pre>
<p>Once the data starts flowing we can add a map visualization:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/getting-more-from-your-logs-with-opentelemetry/image2.png" alt="" /></p>
<p>Add a layer:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/getting-more-from-your-logs-with-opentelemetry/image4.png" alt="" /></p>
<p>Use ES|QL</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/getting-more-from-your-logs-with-opentelemetry/image10.png" alt="" /></p>
<p>Use the following ES|QL</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/getting-more-from-your-logs-with-opentelemetry/image13.png" alt="" /></p>
<p>And this should give you a map showing the locations of all your NGINX server requests!</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/getting-more-from-your-logs-with-opentelemetry/image11.png" alt="" /></p>
<p>As you can see, analytics is a breeze with your new Otel data collection pipeline.</p>
<h2>Conclusion: Beyond log aggregation to operational intelligence</h2>
<p>The journey from basic log aggregation to structured, enriched observability represents more than a technical upgrade, it's a shift in how organizations approach system understanding and incident response. By adopting OpenTelemetry for ingestion, implementing intelligent filtering to manage costs, and leveraging LogsDB's storage optimizations, you're not just modernizing your ELK stack; you're building the foundation for proactive system management.</p>
<p>The structured logs, geographic enrichment, and analytical capabilities demonstrated here transform raw log data into actionable intelligence with ES|QL. Instead of reactive grepping through logs during incidents, you now have the infrastructure to identify patterns, track user journeys, and correlate issues across your entire stack before they become critical problems.</p>
<p>But here's the key question: Are you prepared to act on these insights? Having rich, structured data is only valuable if your organization can shift from a reactive &quot;find and fix&quot; mentality to a proactive &quot;predict and prevent&quot; approach. The real evolution isn't in your logging stack, it's in your operational culture.</p>
<p>Get started with this today in <a href="https://cloud.elastic.co/serverless-registration?onboarding_token=observability">Elastic Serverless</a></p>]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/getting-more-from-your-logs-with-opentelemetry/getting-more-from-your-logs-with-opentelemetry.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[How to remove PII from your Elastic data in 3 easy steps]]></title>
            <link>https://www.elastic.co/observability-labs/blog/remove-pii-data</link>
            <guid isPermaLink="false">remove-pii-data</guid>
            <pubDate>Tue, 20 Jun 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[Personally Identifiable Information compliance is an ever increasing challenge for any organization. With Elastic's intuitive ML interface and parsing capabilities, sensitive data may be easily redacted from unstructured data with ease.]]></description>
            <content:encoded><![CDATA[<p>Personally identifiable information (PII) compliance is an ever-increasing challenge for any organization. Whether you’re in ecommerce, banking, healthcare, or other fields where data is sensitive, PII may inadvertently be captured and stored. Having structured logs enables quick identification, removal, and protection of sensitive data fields easily; but what about unstructured messages? Or perhaps call center transcriptions?</p>
<p>Elasticsearch, with its long experience in <a href="https://www.elastic.co/what-is/elasticsearch-machine-learning">machine learning</a>, provides various options to bring in custom models, such as large language models (LLMs), and provides its own models. These models will help implement PII redaction.</p>
<p>If you would like to learn more about natural language processing, machine learning, and Elastic, please be sure to check out these related articles:</p>
<ul>
<li><a href="https://www.elastic.co/blog/introduction-to-nlp-with-pytorch-models">Introduction to modern natural language processing with PyTorch in Elasticsearch</a></li>
<li><a href="https://www.elastic.co/blog/how-to-deploy-natural-language-processing-nlp-getting-started">How to deploy natural language processing (NLP): Getting started</a></li>
<li><a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/redact-processor.html">Elastic Redact Processor Documentation</a></li>
<li><a href="https://www.elastic.co/blog/may-2023-launch-sparse-encoder-ai-model">Introducing Elastic Learned Sparse Encoder: Elastic’s AI model for semantic search</a></li>
<li><a href="https://www.elastic.co/blog/may-2023-launch-machine-learning-models">Accessing machine learning models in Elastic</a></li>
</ul>
<p>In this blog, we will show you how to set up PII redaction through the use of Elasticsearch’s ability to load a trained model within machine learning and the flexibility of Elastic’s ingest pipelines.</p>
<p>Specifically, we’ll walk through setting up a <a href="https://www.elastic.co/blog/how-to-deploy-nlp-named-entity-recognition-ner-example">named entity recognition (NER)</a> model for person and location identification, as well as deploying the redact processor for custom data identification and removal. All of this will then be combined with an ingest pipeline where we can use Elastic machine learning and data transformations capabilities to remove sensitive information from your data.</p>
<h2>Loading the trained model</h2>
<p>Before we begin, we must load our NER model into our Elasticsearch cluster. This may be easily accomplished with Docker and the Elastic Eland client. From a command line, let’s install the Eland client via git:</p>
<pre><code class="language-bash">git clone https://github.com/elastic/eland.git
</code></pre>
<p>Navigate into the recently downloaded client:</p>
<pre><code class="language-bash">cd eland/
</code></pre>
<p>Now let’s build the client:</p>
<pre><code class="language-bash">docker build -t elastic/eland .
</code></pre>
<p>From here, you’re ready to deploy the trained model to an Elastic machine learning node! Be sure to replace your username, password, es-cluster-hostname, and esport.</p>
<p>If you’re using the Elastic Cloud or have signed certificates, simply run this command:</p>
<pre><code class="language-bash">docker run -it --rm --network host elastic/eland eland_import_hub_model --url https://&lt;username&gt;:&lt;password&gt;@&lt;es-cluster-hostname&gt;:&lt;esport&gt;/ --hub-model-id dslim/bert-base-NER --task-type ner --start
</code></pre>
<p>If you’re using self-signed certificates, run this command:</p>
<pre><code class="language-bash">docker run -it --rm --network host elastic/eland eland_import_hub_model --url https://&lt;username&gt;:&lt;password&gt;@&lt;es-cluster-hostname&gt;:&lt;esport&gt;/ --insecure --hub-model-id dslim/bert-base-NER --task-type ner --start
</code></pre>
<p>From here you’ll witness the Eland client in action downloading the trained model from <a href="https://huggingface.co/dslim/bert-base-NER">HuggingFace</a> and automatically deploying it into your cluster!</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/remove-pii-data/blog-elastic-huggingface.png" alt="huggingface code" /></p>
<p>Synchronize your newly loaded trained model by clicking on the blue hyperlink via your Machine Learning Overview UI “Synchronize your jobs and trained models.”</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/remove-pii-data/blog-elastic-Machine-Learning-Overview-UI.png" alt="Machine Learning Overview UI" /></p>
<p>Now click the Synchronize button.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/remove-pii-data/blog-elastic-Synchronize-button.png" alt="Synchronize button" /></p>
<p>That’s it! Congratulations, you just loaded your first trained model into Elastic!</p>
<h2>Create the redact processor and ingest pipeline</h2>
<p>From DevTools, let’s configure the redact processor along with our inference processor to take advantage of Elastic’s trained model we just loaded. This will create an ingest pipeline named “redact” that we can then use to remove sensitive data from any field we wish. In this example, I’ll be focusing on the “message” field. Note: at the time of this writing, the redact processor is experimental and must be created via DevTools.</p>
<pre><code class="language-bash">PUT _ingest/pipeline/redact
{
  &quot;processors&quot;: [
    {
      &quot;set&quot;: {
        &quot;field&quot;: &quot;redacted&quot;,
        &quot;value&quot;: &quot;{{{message}}}&quot;
      }
    },
    {
      &quot;inference&quot;: {
        &quot;model_id&quot;: &quot;dslim__bert-base-ner&quot;,
        &quot;field_map&quot;: {
          &quot;message&quot;: &quot;text_field&quot;
        }
      }
    },
    {
      &quot;script&quot;: {
        &quot;lang&quot;: &quot;painless&quot;,
        &quot;source&quot;: &quot;String msg = ctx['message'];\r\n                for (item in ctx['ml']['inference']['entities']) {\r\n                msg = msg.replace(item['entity'], '&lt;' + item['class_name'] + '&gt;')\r\n                }\r\n                ctx['redacted']=msg&quot;
      }
    },
    {
      &quot;redact&quot;: {
        &quot;field&quot;: &quot;redacted&quot;,
        &quot;patterns&quot;: [
          &quot;%{EMAILADDRESS:EMAIL}&quot;,
          &quot;%{IP:IP_ADDRESS}&quot;,
          &quot;%{CREDIT_CARD:CREDIT_CARD}&quot;,
          &quot;%{SSN:SSN}&quot;,
          &quot;%{PHONE:PHONE}&quot;
        ],
        &quot;pattern_definitions&quot;: {
          &quot;CREDIT_CARD&quot;: &quot;\d{4}[ -]\d{4}[ -]\d{4}[ -]\d{4}&quot;,
          &quot;SSN&quot;: &quot;\d{3}-\d{2}-\d{4}&quot;,
          &quot;PHONE&quot;: &quot;\d{3}-\d{3}-\d{4}&quot;
        }
      }
    },
    {
      &quot;remove&quot;: {
        &quot;field&quot;: [
          &quot;ml&quot;
        ],
        &quot;ignore_missing&quot;: true,
        &quot;ignore_failure&quot;: true
      }
    }
  ],
  &quot;on_failure&quot;: [
    {
      &quot;set&quot;: {
        &quot;field&quot;: &quot;failure&quot;,
        &quot;value&quot;: &quot;pii_script-redact&quot;
      }
    }
  ]
}
</code></pre>
<p>OK, but what does each processor really do? Let’s walk through each processor in detail here:</p>
<ol>
<li>
<p>The SET processor creates the field “redacted,” which is copied over from the message field and used later on in the pipeline.</p>
</li>
<li>
<p>The INFERENCE processor calls the NER model we loaded to be used on the message field for identifying names, locations, and organizations.</p>
</li>
<li>
<p>The SCRIPT processor then replaced the detected entities within the redacted field from the message field.</p>
</li>
<li>
<p>Our REDACT processor uses Grok patterns to identify any custom set of data we wish to remove from the redacted field (which was copied over from the message field).</p>
</li>
<li>
<p>The REMOVE processor deletes the extraneous ml.* fields from being indexed; note we’ll add “message” to this processor once we validate data is being redacted properly.</p>
</li>
<li>
<p>The ON_FAILURE / SET processor captures any errors just in case we have them.</p>
</li>
</ol>
<h2>Slice your PII</h2>
<p>Now that your ingest pipeline with all the necessary steps has been configured, let’s start testing how well we can remove sensitive data from documents. Navigate over to Stack Management, select Ingest Pipelines and search for “redact”, and then click on the result.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/remove-pii-data/blog-elastic-Ingest-Pipelines.png" alt="Ingest Pipelines" /></p>
<p>Click on the Manage button, and then click Edit.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/remove-pii-data/elastic-blog-Manage-button.png" alt="Manage button" /></p>
<p>Here we are going to test our pipeline by adding some documents. Below is a sample you can copy and paste to make sure everything is working correctly.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/remove-pii-data/elastic-blog-test-pipeline.png" alt="test pipeline" /></p>
<pre><code class="language-yaml">{
  &quot;_source&quot;:
    {
      &quot;message&quot;: &quot;John Smith lives at 123 Main St. Highland Park, CO. His email address is jsmith123@email.com and his phone number is 412-189-9043.  I found his social security number, it is 942-00-1243. Oh btw, his credit card is 1324-8374-0978-2819 and his gateway IP is 192.168.1.2&quot;,
    },
}
</code></pre>
<p>Simply press the Run the pipeline button, and you will then see the following output:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/remove-pii-data/elastic-blog-pii-output-2.png" alt="pii output code" /></p>
<h2>What’s next?</h2>
<p>After you’ve added this ingest pipeline to a data set you’re indexing and validated that it is meeting expectations, you can add the message field to be removed so that no PII data is indexed. Simply update your REMOVE processor to include the message field and simulate again to only see the redacted field.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/remove-pii-data/elastic-blog-manage-processor.png" alt="" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/remove-pii-data/elastic-blog-pii-output.png" alt="pii output code 2" /></p>
<h2>Conclusion</h2>
<p>With this step-by-step approach, you are now ready and able to detect and redact any sensitive data throughout your indices.</p>
<p>Here’s a quick recap of what we covered:</p>
<ul>
<li>Loading a pre-trained named entity recognition model into an Elastic cluster</li>
<li>Configuring the Redact processor, along with the inference processor, to use the trained model during data ingestion</li>
<li>Testing sample data and modifying the ingest pipeline to safely remove personally identifiable information</li>
</ul>
<p>Ready to get started? Sign up <a href="https://cloud.elastic.co/registration">for Elastic Cloud</a> and try out the features and capabilities I’ve outlined above to get the most value and visibility out of your OpenTelemetry data.</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
<p><em>In this blog post, we may have used third party generative AI tools, which are owned and operated by their respective owners. Elastic does not have any control over the third party tools and we have no responsibility or liability for their content, operation or use, nor for any loss or damage that may arise from your use of such tools. Please exercise caution when using AI tools with personal, sensitive or confidential information. Any data you submit may be used for AI training or other purposes. There is no guarantee that information you provide will be kept secure or confidential. You should familiarize yourself with the privacy practices and terms of use of any generative AI tools prior to use.</em></p>
<p><em>Elastic, Elasticsearch and associated marks are trademarks, logos or registered trademarks of Elasticsearch N.V. in the United States and other countries. All other company and product names are trademarks, logos or registered trademarks of their respective owners.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/remove-pii-data/blog-post4-ai-search-B.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Gaining new perspectives beyond logging: An introduction to application performance monitoring]]></title>
            <link>https://www.elastic.co/observability-labs/blog/introduction-apm-tracing-logging</link>
            <guid isPermaLink="false">introduction-apm-tracing-logging</guid>
            <pubDate>Tue, 30 May 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[Change is on the horizon for the world of logging. In this post, we’ll outline a recommended journey for moving from just logging to a fully integrated solution with logs, traces, and APM.]]></description>
            <content:encoded><![CDATA[<h2>Prioritize customer experience with APM and tracing</h2>
<p>Enterprise software development and operations has become an interesting space. We have some incredibly powerful tools at our disposal, yet as an industry, we have failed to adopt many of these tools that can make our lives easier. One such tool that is currently underutilized is <a href="https://www.elastic.co/blog/adding-free-and-open-elastic-apm-as-part-of-your-elastic-observability-deployment">application performance monitoring</a> (APM) and tracing, despite the fact that OpenTelemetry has made it possible to adopt at low friction.</p>
<p>Logging, however, is ubiquitous. Every software application has logs of some kind, and the default workflow for troubleshooting (even today) is to go from exceptions experienced by customers and systems to the logs and start from there to find a solution.</p>
<p>There are various challenges with this, one of the main ones being that logs often do not give enough information to solve the problem. Many services today return ambiguous 500 errors with little or nothing to go on. What if there isn’t an error or log file at all or the problem is that the system is very slow? Logging alone cannot help solve these problems. This leaves users with half broken systems and poor user experiences. We’ve all been on the wrong side of this, and it can be incredibly frustrating.</p>
<p>The question I find myself asking is why does the customer experience often come second to errors? If the customer experience is a top priority, then a strategy should be in place to adopt tracing and APM and make this as important as logging. Users should stop going to logs by default and thinking primarily in logs, as many are doing today. This will also come with some required changes to mental models.</p>
<p>What’s the path to get there? That’s exactly what we will explore in this blog post. We will start by talking about supporting organizational changes, and then we’ll outline a recommended journey for moving from just logging to a fully integrated solution with logs, traces, and APM.</p>
<h2>Cultivating a new monitoring mindset: How to drive APM and tracing adoption</h2>
<p>To get teams to shift their troubleshooting mindset, what organizational changes need to be made?</p>
<p>Initially, businesses should consider strategic priorities and goals that need to be shared broadly among the teams. One thing that can help drive this in a very large organization is to consider an entire product team devoted to Observability or a CoE (Center of Excellence) with its own roadmap and priorities.</p>
<p>This team (either virtual or permanent) should start with the customer in mind and work backward, starting with key questions like: What do I need to collect? What do I need to observe? How do I act? Once team members understand the answers to these questions, they can start to think about the technology decisions needed to drive those outcomes.</p>
<p>From a tracing and APM perspective, the areas of greatest concern are the customer experience, service level objectives, and service level outcomes. From here, organizations can start to implement programs of work to continuously improve and share knowledge across teams. This will help to align teams around a common framework with shared goals.</p>
<p>In the next few sections, we will go through a four step journey to help you maximize your success with APM and tracing. This journey will take you through the following key steps on your journey to successful APM adoption:</p>
<ol>
<li><strong>Ingest:</strong> What choices do you have to make to get tracing activated and start ingesting trace data into your observability tools?</li>
<li><strong>Integrate:</strong> How does tracing integrate with logs to enable full end-to-end observability, and what else beyond simple tracing can you utilize to get even better resolution on your data?</li>
<li><strong>Analytics and AIOPs:</strong> Improve the customer experience and reduce the noise through machine learning.</li>
<li><strong>Scale and total cost of ownership:</strong> Roll out enterprise-wide tracing and adopt strategies to deal with data volume.</li>
</ol>
<h2>1. Ingest</h2>
<p>Ingesting data for APM purposes generally involves “instrumenting” the application. In this section, we will explore methods for instrumenting applications, talk a little bit about sampling, and finally wrap up with a note on using common schemas for data representation.</p>
<h3>Getting started with instrumentation</h3>
<p>What options do we have for ingesting APM and trace data? There are many, many options we will discuss to help guide you, but first let's take a step back. APM has a deep history — in very first implementations of APM, people were concerned mainly with timing methods, like this below:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/introduction-apm-tracing-logging/blog-elastic-timing-methods.png" alt="timing methods" /></p>
<p>Usually you had a configuration file to specify which methods you wanted to time, and the APM implementation would instrument the specified code with method timings.</p>
<p>From here things started to evolve, and one of the first additions to APM was to add in tracing.</p>
<p>For Java, it’s fairly trivial to implement a system to do this by using what's known as a Java agent. You just specify -javagent command line argument, and the agent code gets access to the dynamic compilation routines within Java so it can modify the code before it is compiled into machine code, allowing you to “wrap” specific methods with timing or tracing routines. So, auto instrumenting Java was one of the first things that the original APM vendors did.</p>
<p><a href="https://opentelemetry.io/docs/instrumentation/java/automatic/">OpenTelemetry has agents like this</a>, and most observability vendors that offer APM solutions have their own proprietary ways of doing this, often with more advanced and differing features from the open source tooling.</p>
<p>Things have moved on since then, and Node.JS and Python are now popular.</p>
<p>As a result, ways of auto instrumenting these language runtimes have appeared, which mostly work by injecting the libraries into the code before starting them up. OpenTelemetry has a way of doing this on Kubernetes with an Operator and sidecar <a href="https://github.com/open-telemetry/opentelemetry-operator/blob/main/README.md">here</a>, which supports Python, Node.JS, Java, and DotNet.</p>
<p>The other alternative is to start adding APM and tracing API calls into your own code, which is not dissimilar to adding logging functionality. You may even wish to create an abstraction in your code to deal with this cross-cutting concern, although this is less of a problem now that there are open standards with which you can implement this.</p>
<p>You can see an example of how to add OpenTelemetry spans and attributes to your code for manual instrumentation below and <a href="https://github.com/davidgeorgehope/ChatGPTMonitoringWithOtel/blob/main/monitor.py">here</a>.</p>
<pre><code class="language-python">from flask import Flask
import monitor  # Import the module
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
import urllib
import os

from opentelemetry import trace
from opentelemetry.sdk.resources import SERVICE_NAME, Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.instrumentation.requests import RequestsInstrumentor


# Service name is required for most backends
resource = Resource(attributes={
    SERVICE_NAME: &quot;your-service-name&quot;
})

provider = TracerProvider(resource=resource)
processor = BatchSpanProcessor(OTLPSpanExporter(endpoint=os.getenv('OTEL_EXPORTER_OTLP_ENDPOINT'),
        headers=&quot;Authorization=Bearer%20&quot;+os.getenv('OTEL_EXPORTER_OTLP_AUTH_HEADER')))

provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer(__name__)
RequestsInstrumentor().instrument()

# Initialize Flask app and instrument it
app = Flask(__name__)

@app.route(&quot;/completion&quot;)
@tracer.start_as_current_span(&quot;do_work&quot;)
def completion():
        span = trace.get_current_span()
        if span:
            span.set_attribute(&quot;completion_count&quot;,1)
</code></pre>
<p>By implementing APM in this way, you could even eliminate the need to do any logging by storing all your required logging information within span attributes, exceptions, and metrics. The downside is that you can only do this with code that you own, so you will not be able to remove all logs this way.</p>
<h3>Sampling</h3>
<p>Many people don’t realize that APM is an expensive process. It adds a lot of CPU cycles and memory to your applications, and although there is a lot of value to be had, there are certainly trade-offs to be made.</p>
<p>Should you sample everything 100% and eat the cost? Or should you think about an intelligent trade-off with fewer samples or even tail-based sampling, which many products commonly support? Here, we will talk about the two most common sampling techniques — head-based sampling and tail-based sampling — to help you decide.</p>
<p><strong>Head-based sampling</strong><br />
In this approach, sampling decisions are made at the beginning of a trace, typically at the entry point of a service or application. A fixed rate of traces is sampled, and this decision propagates through all the services involved in a distributed trace.</p>
<p>With head-based sampling, you can control the rate using a configuration, allowing you to control the percentage of requests that are sampled and reported to the APM server. For instance, a sampling rate of 0.5 means that only 50% of requests are sampled and sent to the server. This is useful for reducing the amount of collected data while still maintaining a representative sample of your application's performance.</p>
<p><strong>Tail-based sampling</strong><br />
Unlike head-based sampling, tail-based sampling makes sampling decisions after the entire trace has been completed. This allows for more intelligent sampling decisions based on the actual trace data, such as only reporting traces with errors or traces that exceed a certain latency threshold.</p>
<p>We recommend tail-based sampling because it has the highest likelihood of reducing the noise and helping you focus on the most important issues. It also helps keep costs down on the data store side. A downside of tail-based sampling, however, is that it results in more data being generated from APM agents. This could use more CPU and memory on your application.</p>
<h3>OpenTelemetry Semantic Conventions and Elastic Common Schema</h3>
<p>OpenTelemetry prescribes Semantic Conventions, or Semantic Attributes, to establish uniform names for various operations and data types. Adhering to these conventions fosters standardization across codebases, libraries, and platforms, ultimately streamlining the monitoring process.</p>
<p>Creating OpenTelemetry spans for tracing is flexible, allowing implementers to annotate them with operation-specific attributes. These spans represent particular operations within and between systems, often involving widely recognized protocols like HTTP or database calls. To effectively represent and analyze a span in monitoring systems, supplementary information is necessary, contingent upon the protocol and operation type.</p>
<p>Unifying attribution methods across different languages is essential for operators to easily correlate and cross-analyze telemetry from polyglot microservices without needing to grasp language-specific nuances.</p>
<p>Elastic's recent contribution of the Elastic Common Schema to OpenTelemetry enhances Semantic Conventions to encompass logs and security.</p>
<p>Abiding by a shared schema yields considerable benefits, enabling operators to rapidly identify intricate interactions and correlate logs, metrics, and traces, thereby expediting root cause analysis and reducing time spent searching for logs and pinpointing specific time frames.</p>
<p>We advocate for adhering to established schemas such as ECS when defining trace, metrics, and log data in your applications, particularly when developing new code. This practice will conserve time and effort when addressing issues.</p>
<h2>2. Integrate</h2>
<p>Integrations are very important for APM. How well your solution can integrate with other tools and technologies such as cloud, as well as its ability to integrate logs and metrics into your tracing data, is critical to fully understand the customer experience. In addition, most APM vendors have adjacent solutions for <a href="https://www.elastic.co/observability/synthetic-monitoring">synthetic monitoring</a> and profiling to gain deeper perspectives to supercharge your APM. We will explore these topics in the following section.</p>
<h3>APM + logs = superpowers!</h3>
<p>Because APM agents can instrument code, they can also instrument code that is being used for logging. This way, you can capture log lines directly within APM. <a href="https://www.elastic.co/guide/en/observability/master/logs-send-application.html">This is normally simple to enable</a>.</p>
<p>With this enabled, you will also get automated injection of useful fields like these:</p>
<ul>
<li>service.name, service.version, service.environment</li>
<li>trace.id, transaction.id, error.id</li>
</ul>
<p>This means log messages will be automatically correlated with transactions as shown below, making it far easier to reduce mean time to resolution (MTTR) and find the needle in the haystack:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/introduction-apm-tracing-logging/blog-elastic-latency-distribution.png" alt="latency distribution" /></p>
<p>If this is available to you, we highly recommend turning it on.</p>
<h3>Deploying APM inside Kubernetes</h3>
<p>It is common for people to want to deploy APM inside a Kubernetes environment, and tracing is critical for monitoring applications in cloud-native environments. There are three different ways you can tackle this.</p>
<p><strong>1. Auto instrumentation using sidecars</strong><br />
With Kubernetes, it is possible to use an init container and something that will modify Kubernetes manifests on the fly to auto instrument your applications.</p>
<p>The init container will be used simply to copy the required library or jar file into the container at startup that you need to the main Kubernetes pod. Then, you can use <a href="https://kustomize.io/">Kustomize</a> to add the required command line arguments to bootstrap your agents.</p>
<p>If you are not familiar with it, Kustomize adds, removes, or modifies Kubernetes manifests on the fly. It is even available as a flag to the Kubernetes CLI — simply execute kubectl -k.</p>
<p>OpenTelemetry has an <a href="https://github.com/open-telemetry/opentelemetry-operator/blob/main/README.md">operator</a> that does all this for you automatically (without the need for Kustomize) for Java, DotNet, Python, and Node.JS, and many vendors also have their own operator or <a href="https://www.elastic.co/guide/en/apm/attacher/current/apm-attacher.html">helm charts</a> that can achieve the same result.</p>
<p><strong>2. Baking APM into containers or code</strong><br />
A second option for deploying out APM in Kubernetes — and indeed any containerized environment — is using Docker to bake the APM agents and configuration into a dockerfile.</p>
<p>Have a look at an example here using the OpenTelemetry Java Agent:</p>
<pre><code class="language-dockerfile"># Use the official OpenJDK image as the base image
FROM openjdk:11-jre-slim

# Set up environment variables
ENV APP_HOME /app
ENV OTEL_VERSION 1.7.0-alpha
ENV OTEL_JAVAAGENT_URL https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/download/v${OTEL_VERSION}/opentelemetry-javaagent-${OTEL_VERSION}-all.jar

# Create the application directory
RUN mkdir $APP_HOME
WORKDIR $APP_HOME

# Download the OpenTelemetry Java agent
ADD ${OTEL_JAVAAGENT_URL} /otel-javaagent.jar

# Add your Java application JAR file
COPY your-java-app.jar $APP_HOME/your-java-app.jar

# Expose the application port (e.g. 8080)
EXPOSE 8080

# Configure the OpenTelemetry Java agent and run the application
CMD java -javaagent:/otel-javaagent.jar \
      -Dotel.resource.attributes=service.name=your-service-name \
      -Dotel.exporter.otlp.endpoint=your-otlp-endpoint:4317 \
      -Dotel.exporter.otlp.insecure=true \
      -jar your-java-app.jar
</code></pre>
<p><strong>3. Tracing using a service mesh (Envoy/Istio)</strong><br />
The final option you have here is if you are using a service mesh. A service mesh is a dedicated infrastructure layer for handling service-to-service communication in a microservices architecture. It provides a transparent, scalable, and efficient way to manage and control the communication between services, enabling developers to focus on building application features without worrying about inter-service communication complexities.</p>
<p>The great thing about this is that we can activate tracing within the proxy and therefore get visibility into requests between services. We don’t have to change any code or even run APM agents for this; we simply turn on the OpenTelemetry collector that exists within the proxy — therefore this is likely the lowest overhead solution. <a href="https://www.envoyproxy.io/docs/envoy/latest/start/sandboxes/opentelemetry">Learn more about this option</a>.</p>
<h3>Synthetics Universal Profiling</h3>
<p>Most APM vendors have add ons to the primary APM use cases. Typically we see synthetics and <a href="https://www.elastic.co/observability/universal-profiling">continuous profiling</a> being added to APM solutions. APM can integrate with both, and there is some good value in bringing these technologies together to give even more insights into issues.</p>
<p><strong>Synthetics</strong><br />
Synthetic monitoring is a method used to measure the performance, availability, and reliability of web applications, websites, and APIs by simulating user interactions and traffic. It involves creating scripts or automated tests that mimic real user behavior, such as navigating through pages, filling out forms, or clicking buttons, and then running these tests periodically from different locations and devices.</p>
<p>This gives Development and Operations teams the ability to spot problems far earlier than they might otherwise, catching issues before real users do in many cases.</p>
<p>Synthetics can be integrated with APM — inject an APM agent into the website when the script runs, so even if you didn’t put end user monitoring into your website initially, it can be injected at run time. This usually happens without any input from the user. From there, a tracing id for each request can be passed down through the various layers of the system, allowing teams to follow the request all the way from the synthetics script to the lowest levels of the application stack such as the database.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/introduction-apm-tracing-logging/blog-elastic-rainbow-sandals.png" alt="observability rainbow sandals" /></p>
<p><strong>Universal profiling</strong><br />
“Profiling” is a dynamic method of analyzing the complexity of a program, such as CPU utilization or the frequency and duration of function calls. With profiling, you can locate exactly which parts of your application are consuming the most resources. <a href="https://www.elastic.co/observability/universal-profiling">“Continuous profiling”</a> is a more powerful version of profiling that adds the dimension of time. By understanding your system’s resources over time, you can then locate, debug, and fix issues related to performance.</p>
<p>Universal profiling is a further extension of this, which allows you to capture profile information about all of the code running in your system all the time. Using a technology like <a href="https://www.elastic.co/blog/ebpf-observability-security-workload-profiling">eBPF</a> can allow you to see <em>all</em> the function calls in your systems, including into things like the Kubernetes runtime. Doing this gives you the ability to finally see unknown unknowns — things you didn’t know were problems. This is very different from APM, which is really about tracking individual traces and requests and the overall customer experience. Universal profiling is about overcoming those issues you didn’t even know existed and even answering the question “What is my most expensive line of code?”</p>
<p>Universal profiling can be linked into APM, showing you profiles that occurred during a specific customer issue, for example, or by linking profiles directly to traces by looking at the global state that exists at the thread level. These technologies can work wonders when used together.</p>
<p>Typically, profiles are viewed as “flame graphs” shown below. The boxes represent the amount of “on-cpu” time spent executing a particular function.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/introduction-apm-tracing-logging/blog-elastic-universal-profiling.png" alt="observability universal profiling" /></p>
<h2>3. Analytics and AIOps</h2>
<p>The interesting thing about APM is it opens up a whole new world of analytics versus just logs. All of a sudden, you have access to the information flows from <em>inside</em> applications.</p>
<p>This allows you to easily capture things like the amount of money a specific customer is currently spending on your most critical ecommerce store, or look at failed trades in a brokerage app to see how much lost revenue those failures are impacting. You can even then apply machine learning algorithms to project future spend or look at anomalies occurring in this data, giving you a new window into how your business runs.</p>
<p>In this section, we will look at ways to do this and how to get the most out of this new world, as well as how to apply AIOps practices to this new data. We will also discuss getting SLIs and SLOs setup for APM data.</p>
<h3>Getting business data into your traces</h3>
<p>There are generally two ways of getting business data into your traces. You can modify code and add in Span attributes, an example of which is available <a href="https://github.com/davidgeorgehope/ChatGPTMonitoringWithOtel/blob/main/monitor.py">here</a> and shown below. Or you can write an extension or a plugin, which has the benefit of avoiding code changes. OpenTelemetry supports <a href="https://opentelemetry.io/docs/instrumentation/java/extensions/">adding extensions in its auto-instrumentation agents</a>. Most other APM vendors usually have something similar.</p>
<pre><code class="language-python">def count_completion_requests_and_tokens(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        counters['completion_count'] += 1
        response = func(*args, **kwargs)

        token_count = response.usage.total_tokens
        prompt_tokens = response.usage.prompt_tokens
        completion_tokens = response.usage.completion_tokens
        cost = calculate_cost(response)
        strResponse = json.dumps(response)

        # Set OpenTelemetry attributes
        span = trace.get_current_span()
        if span:
            span.set_attribute(&quot;completion_count&quot;, counters['completion_count'])
            span.set_attribute(&quot;token_count&quot;, token_count)
            span.set_attribute(&quot;prompt_tokens&quot;, prompt_tokens)
            span.set_attribute(&quot;completion_tokens&quot;, completion_tokens)
            span.set_attribute(&quot;model&quot;, response.model)
            span.set_attribute(&quot;cost&quot;, cost)
            span.set_attribute(&quot;response&quot;, strResponse)
        return response
    return wrapper
</code></pre>
<h3>Using business data for fun and profit</h3>
<p>Once you have the business data in your traces, you can start to have some fun with it. Take a look at the example below for a financial services fraud team. Here we are tracking transactions — average transaction value for our larger business customers. Crucially, we can see if there are any unusual transactions.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/introduction-apm-tracing-logging/blog-elastic-customer-count.png" alt="customer count" /></p>
<p>A lot of this is powered by machine learning, which can classify transactions or do <a href="https://www.elastic.co/blog/reduce-mttd-ml-machine-learning-observability">anomaly detection</a>. Once you start capturing the data, it is possible to do a lot of useful things like this, and with a flexible platform, integrating machine learning models into this process becomes a breeze.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/introduction-apm-tracing-logging/blog-elastic-fraud-12h.png" alt="fraud 12-h" /></p>
<h3>SLIs and SLOs</h3>
<p>Service level indicators (SLIs) and service level objectives (SLOs) serve as critical components for maintaining and enhancing application performance. SLIs, which represent key performance metrics such as latency, error rate, and throughput, help quantify an application's performance, while SLOs establish target performance levels to meet user expectations.</p>
<p>By selecting relevant SLIs and setting achievable SLOs, organizations can better monitor their application's performance using APM tools. Continually evaluating and adjusting SLIs and SLOs in response to changes in application requirements, user expectations, or the competitive landscape ensures that the application remains competitive and delivers an exceptional user experience.</p>
<p>In order to define and track SLIs and SLOs, APM becomes a critical perspective that is needed for understanding the user experience. Once APM is implemented, we recommend that organizations perform the following steps.</p>
<ul>
<li>Define SLOs and SLIs required to track them.</li>
<li>Define SLO budgets and how they are calculated. Reflect business’ perspective and set realistic targets.</li>
<li>Define SLIs to be measured from a user experience perspective.</li>
<li>Define different alerting and paging rules, page only on customer facing SLO degradations, record symptomatic alerts, notify on critical symptomatic alerts.</li>
</ul>
<p>Synthetic monitoring and end user monitoring (EUM) can also help with getting even more data required to understand latency, throughput, and error rate from the user’s perspective, where it is critical to get good business focused metrics and data from.</p>
<h2>4. Scale and total cost of ownership</h2>
<p>With increased perspectives, customers often run into scalability and total cost of ownership issues. All this new data can be overwhelming. Luckily there are various techniques you can use to deal with this. Tracing itself can actually help with volume challenges because you can decompose unstructured logs and combine them with traces, which leads to additional efficiency. You can also use different sampling methods to deal with scale challenges (i.e., both techniques we previously mentioned).</p>
<p>In addition to this, for large enterprise scale, we can use streaming pipelines like Kafka or Pulsar to manage the data volumes. This has an additional benefit that you get for free: if you take down the systems consuming the data or they face outages, it is less likely you will lose data.</p>
<p>With this configuration in place, your “Observability pipeline” architecture would look like this:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/introduction-apm-tracing-logging/blog-elastic-opentelemetry-collector.png" alt="opentelemetry collector" /></p>
<p>This completely decouples your sources of data from your chosen observability solution, which will future proof your observability stack going forward, enable you to reach massive scale, and make you less reliant on specific vendor code for collection of data.</p>
<p>Another thing we recommend doing is being intelligent about instrumentation. This will serve two benefits: you will get some CPU cycles back in the instrumented application, and your backend data collection systems will have less data to process. If you know, for example, that you have no interest in tracking calls to a specific endpoint, you can exclude those classes and methods from instrumentation.</p>
<p>And finally, data tiering is a transformative approach for managing data storage that can significantly reduce the total cost of ownership (TCO) for businesses. Primarily, it allows organizations to store data across different types of storage mediums based on their accessibility needs and the value of the data. For instance, frequently accessed, high-value data can be stored in expensive, high-speed storage, while less frequently accessed, lower-value data can be stored in cheaper, slower storage.</p>
<p>This approach, often incorporated in cloud storage solutions, enables cost optimization by ensuring that businesses only pay for the storage they need at any given time. Furthermore, it provides the flexibility to scale up or down based on demand, eliminating the need for large capital expenditures on storage infrastructure. This scalability also reduces the need for costly over-provisioning to handle potential future demand.</p>
<h2>Conclusion</h2>
<p>In today's highly competitive and fast-paced software development landscape, simply relying on logging is no longer sufficient to ensure top-notch customer experiences. By adopting APM and distributed tracing, organizations can gain deeper insights into their systems, proactively detect and resolve issues, and maintain a robust user experience.</p>
<p>In this blog, we have explored the journey of moving from a logging-only approach to a comprehensive observability strategy that integrates logs, traces, and APM. We discussed the importance of cultivating a new monitoring mindset that prioritizes customer experience, and the necessary organizational changes required to drive APM and tracing adoption. We also delved into the various stages of the journey, including data ingestion, integration, analytics, and scaling.</p>
<p>By understanding and implementing these concepts, organizations can optimize their monitoring efforts, reduce MTTR, and keep their customers satisfied. Ultimately, prioritizing customer experience through APM and tracing can lead to a more successful and resilient enterprise in today's challenging environment.</p>
<p><a href="https://www.elastic.co/observability/application-performance-monitoring">Learn more about APM at Elastic</a>.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/introduction-apm-tracing-logging/log-management-720x420_(2).jpeg" length="0" type="image/jpeg"/>
        </item>
        <item>
            <title><![CDATA[Dynamic workload discovery on Kubernetes now supported with EDOT Collector]]></title>
            <link>https://www.elastic.co/observability-labs/blog/k8s-discovery-with-EDOT-collector</link>
            <guid isPermaLink="false">k8s-discovery-with-EDOT-collector</guid>
            <pubDate>Tue, 01 Apr 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Discover how Elastic's OpenTelemetry Collector leverages Kubernetes pod annotations providing dynamic workload discovery and improves automated metric and log collection for Kubernetes clusters.]]></description>
            <content:encoded><![CDATA[<p>At Elastic, Kubernetes is one of the most significant observability use cases we focus on.
We want to provide the best onboarding experience and lifecycle management based on real-world GitOps best practices.</p>
<p>OpenTelemetry recently <a href="https://opentelemetry.io/blog/2025/otel-collector-k8s-discovery/">published a blog</a> on how to do <code>Autodiscovery based on Kubernetes Pods' annotations</code> with the OpenTelemetry Collector.</p>
<p>In this blog post, we will talk about how to use this Kubernetes-related feature of the OpenTelemetry Collector,
which is already available with the Elastic Distribution of the OpenTelemetry (EDOT) Collector.</p>
<p>In addition to this feature, at Elastic, we heavily invest in making OpenTelemetry the best, standardized ingest solution for Observability.
You might already have seen us focusing on:</p>
<ul>
<li>
<p><a href="https://www.elastic.co/blog/ecs-elastic-common-schema-otel-opentelemetry-announcement">Semantic Conventions standardization</a></p>
</li>
<li>
<p>significant <a href="https://www.elastic.co/observability-labs/blog/elastics-collaboration-opentelemetry-filelog-receiver">log collection improvements</a></p>
</li>
<li>
<p>various other topics around <a href="https://www.elastic.co/observability-labs/blog/auto-instrumentation-go-applications-opentelemetry">instrumentation</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/elastic-donation-proposal-to-contribute-profiling-agent-to-opentelemetry">profiling</a></p>
</li>
</ul>
<p>Let's walk you through a hands-on journey using the EDOT Collector covering various use cases you might encounter in the real world, highlighting the capabilities of this powerful feature.</p>
<h2>Configuring EDOT Collector</h2>
<p>The Collector’s configuration is not our main focus here, since based on the nature of this feature it is minimal,
letting workloads define how they should be monitored.</p>
<p>To illustrate the point, here is the Collector configuration snippet that enables the feature for both logs and metrics:</p>
<pre><code class="language-yaml">receivers:
    receiver_creator/metrics:
      watch_observers: [k8s_observer]
      discovery:
        enabled: true
      receivers:

    receiver_creator/logs:
      watch_observers: [k8s_observer]
      discovery:
        enabled: true
      receivers:
</code></pre>
<p>You can include the above in the EDOT’s Collector configuration, specifically the
<a href="https://github.com/elastic/elastic-agent/blob/v9.0.0-rc1/deploy/helm/edot-collector/kube-stack/values.yaml#L339">receivers’ section</a>.</p>
<p>Since logs collection in our examples will happen from the discovery feature make sure that the static filelog receiver
<a href="https://github.com/elastic/elastic-agent/blob/v9.0.0-rc1/deploy/helm/edot-collector/kube-stack/values.yaml#L348">configuration block</a> is removed
and its <a href="https://github.com/elastic/elastic-agent/blob/v9.0.0-rc1/deploy/helm/edot-collector/kube-stack/values.yaml#L193"><code>preset</code></a>
is disabled (i.e. set to <code>false</code>) to avoid having log duplication.</p>
<p>Make sure that the receiver creator is properly added in the pipelines for
<a href="https://github.com/elastic/elastic-agent/blob/v9.0.0-rc1/deploy/helm/edot-collector/kube-stack/values.yaml#L471">logs</a>
(in addition to removing the <code>filelog</code> receiver completely)
and <a href="https://github.com/elastic/elastic-agent/blob/v9.0.0-rc1/deploy/helm/edot-collector/kube-stack/values.yaml#L484">metrics</a>
respectively.</p>
<p>Ensure that <a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/v0.122.0/extension/observer/k8sobserver/README.md"><code>k8sobserver</code></a>
is enabled as part of the extensions:</p>
<pre><code class="language-yaml">extensions:
  k8s_observer:
    observe_nodes: true
    observe_services: true
    observe_ingresses: true

// ...

service:
  extensions: [k8s_observer]
</code></pre>
<p>Last but not least, ensure the log files' volume is mounted properly:</p>
<pre><code class="language-yaml">volumeMounts:
 - name: varlogpods
   mountPath: /var/log/pods
   readOnly: true

volumes:
  - name: varlogpods
    hostPath:
      path: /var/log/pods
</code></pre>
<p>Once the configuration is ready follow the <a href="https://www.elastic.co/docs/reference/opentelemetry/quickstart/">Kubernetes quickstart guides on how to deploy the EDOT Collector</a>.
Make sure to replace the <code>values.yaml</code> file linked in the quickstart guide with the file that includes the above-described modifications.</p>
<h3>Collecting Metrics from Moving Targets Based on Their Annotations</h3>
<p>In this example, we have a Deployment with a Pod spec that consists of two different containers.
One container runs a Redis server, while the other runs an NGINX server. Consequently, we want to provide
different hints for each of these target containers.</p>
<p>The annotation-based discovery feature supports this, allowing us to specify metrics annotations
per exposed container port.</p>
<p>Here is how the complete spec file looks:</p>
<pre><code class="language-yaml">apiVersion: v1
kind: ConfigMap
metadata:
  name: nginx-conf
data:
  nginx.conf: |
    user  nginx;
    worker_processes  1;
    error_log  /dev/stderr warn;
    pid        /var/run/nginx.pid;
    events {
      worker_connections  1024;
    }
    http {
      include       /etc/nginx/mime.types;
      default_type  application/octet-stream;

      log_format  main  '$remote_addr - $remote_user [$time_local] &quot;$request&quot; '
                        '$status $body_bytes_sent &quot;$http_referer&quot; '
                        '&quot;$http_user_agent&quot; &quot;$http_x_forwarded_for&quot;';
      access_log  /dev/stdout main;
      server {
          listen 80;
          server_name localhost;

          location /nginx_status {
              stub_status on;
          }
      }
      include /etc/nginx/conf.d/*;
    }
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis-deployment
  labels:
    app: redis
spec:
  replicas: 1
  selector:
    matchLabels:
      app: redis
  template:
    metadata:
      labels:
        app: redis
      annotations:
        # redis container port hints
        io.opentelemetry.discovery.metrics.6379/enabled: &quot;true&quot;
        io.opentelemetry.discovery.metrics.6379/scraper: redis
        io.opentelemetry.discovery.metrics.6379/config: |
          collection_interval: &quot;20s&quot;
          timeout: &quot;10s&quot;

        # nginx container port hints
        io.opentelemetry.discovery.metrics.80/enabled: &quot;true&quot;
        io.opentelemetry.discovery.metrics.80/scraper: nginx
        io.opentelemetry.discovery.metrics.80/config: |
          endpoint: &quot;http://`endpoint`/nginx_status&quot;
          collection_interval: &quot;30s&quot;
          timeout: &quot;20s&quot;
    spec:
      volumes:
      - name: nginx-conf
        configMap:
          name: nginx-conf
          items:
            - key: nginx.conf
              path: nginx.conf
      containers:
        - name: webserver
          image: nginx:latest
          ports:
            - containerPort: 80
              name: webserver
          volumeMounts:
            - mountPath: /etc/nginx/nginx.conf
              readOnly: true
              subPath: nginx.conf
              name: nginx-conf
        - image: redis
          imagePullPolicy: IfNotPresent
          name: redis
          ports:
            - name: redis
              containerPort: 6379
              protocol: TCP
</code></pre>
<p>When this workload is deployed, the Collector will automatically discover it and identify the specific annotations.
After this, two different receivers will be started, each one responsible for each of the target containers.</p>
<h3>Collecting Logs from Multiple Target Containers</h3>
<p>The annotation-based discovery feature also supports log collection based on the provided annotations.
In the example below, we again have a Deployment with a Pod consisting of two different containers,
where we want to apply different log collection configurations.
We can specify annotations that are scoped to individual container names:</p>
<pre><code class="language-yaml">apiVersion: apps/v1
kind: Deployment
metadata:
  name: busybox-logs-deployment
  labels:
    app: busybox
spec:
  replicas: 1
  selector:
    matchLabels:
      app: busybox
  template:
    metadata:
      labels:
        app: busybox
      annotations:
        io.opentelemetry.discovery.logs.lazybox/enabled: &quot;true&quot;
        io.opentelemetry.discovery.logs.lazybox/config: |
          operators:
            - id: container-parser
              type: container
            - id: some
              type: add
              field: attributes.tag
              value: hints-lazybox
        io.opentelemetry.discovery.logs.busybox/enabled: &quot;true&quot;
        io.opentelemetry.discovery.logs.busybox/config: |
          operators:
            - id: container-parser
              type: container
            - id: some
              type: add
              field: attributes.tag
              value: hints-busybox
    spec:
      containers:
        - name: busybox
          image: busybox
          args:
            - /bin/sh
            - -c
            - while true; do echo &quot;otel logs from busybox at $(date +%H:%M:%S)&quot; &amp;&amp; sleep 5s; done
        - name: lazybox
          image: busybox
          args:
            - /bin/sh
            - -c
            - while true; do echo &quot;otel logs from lazybox at $(date +%H:%M:%S)&quot; &amp;&amp; sleep 25s; done
</code></pre>
<p>The above configuration enables two different filelog receiver instances, each applying a unique parsing configuration.
This is handy when we know how to parse specific technology logs, such as Apache server access logs.</p>
<h3>Combining Both Metrics and Logs Collection</h3>
<p>In our third example, we illustrate how to define both metrics and log annotations on the same workload.
This allows us to collect both signals from the discovered workload.
Below is a Deployment with a Pod consisting of a Redis server and a BusyBox container that performs dummy log writing.
We can target annotations to the port and container levels to collect metrics from the Redis server using
the Redis receiver, and logs from the BusyBox using the filelog receiver. Here’s how:</p>
<pre><code class="language-yaml">apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis-deployment
  labels:
    app: redis
spec:
  replicas: 1
  selector:
    matchLabels:
      app: redis
  template:
    metadata:
      labels:
        app: redis
      annotations:
        io.opentelemetry.discovery.metrics.6379/enabled: &quot;true&quot;
        io.opentelemetry.discovery.metrics.6379/scraper: redis
        io.opentelemetry.discovery.metrics.6379/config: |
          collection_interval: &quot;20s&quot;
          timeout: &quot;10s&quot;

        io.opentelemetry.discovery.logs.busybox/enabled: &quot;true&quot;
        io.opentelemetry.discovery.logs.busybox/config: |
          operators:
            - id: container-parser
              type: container
            - id: some
              type: add
              field: attributes.tag
              value: hints
    spec:
      containers:
        - image: redis
          imagePullPolicy: IfNotPresent
          name: redis
          ports:
            - name: redis
              containerPort: 6379
              protocol: TCP
        - name: busybox
          image: busybox
          args:
            - /bin/sh
            - -c
            - while true; do echo &quot;otel logs at $(date +%H:%M:%S)&quot; &amp;&amp; sleep 15s; done
</code></pre>
<h3>Explore and analyse data coming from dynamic targets in Elastic</h3>
<p>Once the target Pods are discovered and the Collector has started collecting telemetry data from them,
we can then explore this data in Elastic. In Discover we can search for Redis and NGINX metrics as well as
logs collected from the Busybox container. Here is how it looks like:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/k8s-discovery-with-EDOT-collector/discoverlogs.png" alt="Logs Discovery" />
<img src="https://www.elastic.co/observability-labs/assets/images/k8s-discovery-with-EDOT-collector/discovermetrics.png" alt="Metrics Discovery" /></p>
<h2>Summary</h2>
<p>The examples above showcase how users of our OpenTelemetry Collector can take advantage of this new feature
— one we played a major role in developing.</p>
<p>For this, we leveraged our years of experience with similar features already supported in
<a href="https://www.elastic.co/guide/en/beats/metricbeat/current/configuration-autodiscover-hints.html">Metricbeat</a>,
<a href="https://www.elastic.co/guide/en/beats/filebeat/current/configuration-autodiscover-hints.html">Filebeat</a>, and
<a href="https://www.elastic.co/guide/en/fleet/current/hints-annotations-autodiscovery.html">Elastic-Agent</a>.
This makes us extremely happy and confident, as it closes the feature gap between Elastic's specific
monitoring agents and the OpenTelemetry Collector — making it even better.</p>
<p>Interested in learning more? Visit the
<a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/receivercreator/README.md#generate-receiver-configurations-from-provided-hints">documentation</a>
and give it a try by following our <a href="https://www.elastic.co/docs/reference/opentelemetry/quickstart/">EDOT quickstart guide</a>.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/k8s-discovery-with-EDOT-collector/k8s-discovery-new.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Troubleshooting Kafka-Logstash-Elasticsearch Performance Issues in delay-sensitive platforms]]></title>
            <link>https://www.elastic.co/observability-labs/blog/kafka-logstash-elasticsearch-performance-issues</link>
            <guid isPermaLink="false">kafka-logstash-elasticsearch-performance-issues</guid>
            <pubDate>Wed, 11 Mar 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how to troubleshoot ingestion bottlenecks in data pipelines built with Kafka, Logstash and Elasticsearch.]]></description>
            <content:encoded><![CDATA[<p>Kafka is an open-source, distributed event streaming and queuing platform widely used with Elastic to build high-throughput, large-scale data pipelines, facilitate seamless data integration, and support mission-critical applications. System designs with Kafka significantly enable the decoupling of components within the data pipeline ensuring scalability and a robust design for failure by managing downstream back-pressure during traffic surges, maintenance activities, or any other periods of performance degradation. </p>
<p>In addition to its queuing capabilities, Kafka can serve as a central processing middleware for data pre-processing and enrichment. This is particularly useful when such operations are impractical to perform directly downstream due to specific business or technical requirements or constraints.</p>
<p>For instance, integrating Kafka with stream processing engines like <a href="https://ksqldb.io/">KsqlDB</a> or <a href="https://materialize.com/">Materialize</a>, allows for advanced stream processing tasks, including SQL-based joins across topics and streams to enrich data at scale in real-time. The enriched datasets can then be ingested into Elasticsearch for further processing at subsequent stages.</p>
<p>Despite these benefits, adopting Kafka or similar queuing systems is arguably conditional. These systems introduce additional costs and complexity to the overall platform implementation and maintenance. They may also add processing overhead, delay data flow to the downstream, and risk becoming bottlenecks if not correctly sized or optimized to align with other pipeline components.</p>
<p>This article provides guidance for troubleshooting ingestion bottlenecks in data pipelines built with Kafka and Elastic. Identifying and fixing such issues can be sometimes challenging, particularly when multiple changes are made across multiple systems aspects at the same time, which often increases the number of variables in play. This commonly results in a longer process and inconsistent results.</p>
<p>Consider the below Security Operations Center (SOC) platform, where data is ingested from various sources via Elastic Agent. The data is queued and pre-processed in a Kafka cluster before being pulled by Logstash and forwarded to Elastic Security. In this environment, delays at any stage of the pipeline can result in critical security events going undetected by Elastic Security, emphasizing the importance of a well-optimized data pipeline.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kafka-logstash-elasticsearch-performance-issues/image5.png" alt="" /></p>
<h2>Implement lag and throughput monitoring</h2>
<p>Ingestion bottlenecks usually materialize as limited throughput and event lags, which often correlate. Monitoring these two indicators is important to measure the impact of tuning attempts. </p>
<p><strong><em>Tip</em></strong><em>: With the anomaly detection features of machine learning you can use the</em> <a href="https://www.elastic.co/guide/en/observability/current/inspect-log-anomalies.html"><em>Logs Anomalies page</em></a> <em>to detect and inspect log anomalies and the log partitions where the log anomalies occur.</em></p>
<p>End-to-end lag monitoring can be broken down into the various stages of the pipeline. The incremental improvements across those stages would collectively contribute to a significant reduction in the end-to-end lag:</p>
<p><strong>A) Ingest lag between the source and Kafka:</strong> This lag is the time difference between the real event-time, which is typically extracted from the event itself or added by the event producer (Elastic Agent for example), and the Kafka record timestamp, which can be added to the Logstash events via event <a href="https://www.elastic.co/guide/en/logstash/current/plugins-inputs-kafka.html#plugins-inputs-kafka-decorate_events">decoration</a> in the Kafka input plugin. </p>
<p>In most cases, this lag is influenced by the write performance of the Kafka cluster and network latency between the event source and Kafka. In some cases, the lag may also appear due to time configuration mismatches that make it look like there's a lag when there really isn't.</p>
<p><strong>B) Ingest lag between Kafka and Logstash:</strong> This lag is the time difference between the Kafka record's timestamp and the execution timestamp of the first filter in the Logstash pipeline. If your pipelines are using a persistent queue, note that this duration also includes the time spent in the PQ.</p>
<p>The below Ruby filter adds the current-time to the event in the `logstash.start` field to use for comparison later.</p>
<pre><code>ruby {
 code =&gt; &quot;event.set(logstash.start, Time.now());&quot;
}
</code></pre>
<p>The primary factors contributing to ingestion lag include the consumption performance of the Kafka cluster, the Logstash input performance, data skew across the different topic partitions, and most importantly, the backpressure propagation to the Logstash input plugin, because Logstash does not fetch new events from the Kafka topic as quickly as they become available, when it is busy processing the events that it has already fetched. </p>
<p>Network latency and reduced size of TCP read buffer (<a href="https://man7.org/linux/man-pages/man7/tcp.7.html">SO_RCVBUF</a>) on the Logstash host can also throttle Logstash from fetching the data from Kafka at the required rate.</p>
<p>Consumer lag serves as an effective indicator of this issue and can be viewed on Kafka's consumer group metrics. It is calculated as the difference between the log-end offset (the offset of the most recently produced message) and the current offset (the last committed offset by the consumer) for each partition.</p>
<pre><code>$KAFKA_HOME/bin/kafka-consumer-groups.sh  --bootstrap-server &lt;server:port&gt; --describe --group &lt;group_id&gt;
</code></pre>
<pre><code>GROUP                 TOPIC             PARTITION  CURRENT-OFFSET  LOG-END-OFFSET   LAG
logstash-cg-soc-1     windows-events    0          4498            17309            12811
logstash-cg-soc-1     windows-events    1          4470            17213            12743
...
</code></pre>
<p><strong>C) Ingest lag in the Logstash processing:</strong> This lag is the time difference between the first and last Logstash filters. To calculate this lag, an additional filter can be added at the end of the pipeline to record the `logstash.end` timestamp in the same way the `logstash.start` field was added before. The primary factors contributing to this lag are the filters efficiency of processing, which is primarily affected by the complexity and optimization of the transformations they perform, access to external services for data loading which might require network, limited number of the <a href="https://www.elastic.co/guide/en/logstash/current/logstash-settings-file.html">pipeline’s workers and small batch size</a>, and the amount of resources available for Logstash – particularly when running on virtual environments with resources contention.</p>
<p><strong>D) Ingest lag between Logstash and Elasticsearch:</strong> This lag is the time difference between the last applied Logstash filter in the pipeline, and the timestamp when the event is ingested in Elasticsearch. The ECS field `<a href="https://www.elastic.co/guide/en/ecs/current/ecs-event.html#field-event-ingested"><code>event.ingested</code></a>` is automatically added by the Elastic integrations to record this value. For custom sources, the field should be added via an ingest pipeline:</p>
<pre><code>{
    &quot;processors&quot;: [
      {
        &quot;set&quot;: {
          &quot;field&quot;: &quot;event.ingested&quot;,
          &quot;value&quot;: &quot;{{_ingest.timestamp}}&quot;
        }
      }
…
</code></pre>
<p>If the data is undergoing heavy processing in Elasticsearch before indexing, it also pays to analyze the performance of each ingest processor in the pipeline to pinpoint and optimize the heaviest ones. <a href="https://github.com/elastic/integrations/pull/4597">Ingest pipelines monitoring dashboard</a> can help streamline this process.</p>
<p>The primary factors contributing to this phase’s lag are usually the Logstash output configuration like a small number of pipeline workers and batch size, slow indexing actions (like upserts), network latency, and how fast the Elasticsearch cluster can run the ingest pipelines and index the data. You can find more techniques about this last point <a href="https://www.elastic.co/docs/deploy-manage/production-guidance/optimize-performance/indexing-speed">here</a>.</p>
<p>Visualizing these stages in Kibana helps identify the most throttled areas and analyze the impact of various parameter adjustments across the entire data pipeline during the tuning process.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kafka-logstash-elasticsearch-performance-issues/image4.png" alt="" /></p>
<h2>Isolate and fix the bottleneck</h2>
<p>Identifying the source of the bottleneck can be challenging without a systematic approach to isolating the behavior of each component and stage of the pipeline. To make the investigation approach more consistent, it is important to keep the source data consistent as well. One approach can be to use a dedicated topic with a replicated production workload, and repeat the test using different consumer groups.</p>
<p>Below is a set of benchmarks that can be driven while monitoring the event lag and the pipeline throughput. The best achieved results from each of the tuning exercises can be used as a basis for the next one.</p>
<h2>First benchmark: Kafka input, no filters, null output</h2>
<p>This benchmark is aimed at assessing the throughput of the Kafka input in isolation, excluding the downstream impacts of the Logstash filters and outputs. Use the <a href="https://www.elastic.co/guide/en/logstash/current/plugins-outputs-sink.html">sink</a> plugin in the output section to discard the events without incurring IO overhead and get a theoretical maximum reading speed.</p>
<p>This test is better performed with and without a <a href="https://www.elastic.co/guide/en/logstash/current/persistent-queues.html#persistent-queues-architecture">persistence queue</a> to isolate the additional overhead at this stage. </p>
<p>It is helpful to use a unique consumer group_id for this test instead of the default `logstash`. Otherwise, this null-output pipeline might consume and drop events that should be processed by other pipelines.</p>
<pre><code>input {
 kafka {
   ...
 }
}
filter {
}
output {
  sink { }
}
</code></pre>
<p>If the throughput from this test closely matches the original pipeline, then most probably you have a closed valve upstream and consuming the events is definitely a bottleneck. </p>
<p>Note that the maximum throughput is significantly impacted by the Kafka cluster's ability to handle consumer requests and network latency. The maximum throughput is also bound by the rate of events that is flowing into the Kafka topic once the consumer group has caught up with the topic.</p>
<p>A few things might be considered in this exercise: </p>
<ul>
<li>
<p><strong>Match consumers count to partitions count:</strong> Ideally, the total number of <a href="https://www.elastic.co/guide/en/logstash/current/plugins-inputs-kafka.html#plugins-inputs-kafka-consumer_threads">consumer threads</a> across all the pipelines that share the same consumer group_id, should be equal to the number of topic partitions for a perfect balance. Each Kafka topic-partition can be assigned to at-most one consumer within a consumer group at a time. So if you have more consumer threads than your topic partitions, some of those threads will not be assigned a partition. Partition-replicas do not count, as consumer threads consume messages from the leader partitions, not directly from replicas. Exceeding 1:1 ratio may also introduce unnecessary computational overhead in Logstash without any gains in read throughput. Incrementally increasing the partition count in the topic can potentially improve the throughput. Kafka 4.0 introduces early access to <a href="https://cwiki.apache.org/confluence/display/KAFKA/KIP-932%3A+Queues+for+Kafka">KIP932</a>, which bypasses this 1:1 mapping requirement using share groups implementing a queuing semantic to the consumption model. The Share Groups are not supported in Logstash yet.</p>
</li>
<li>
<p><strong>Tune the input parameters for maximum throughput:</strong> Increasing <code>max.poll.records</code>, <code>fetch.max.bytes</code>, and <code>receive.buffer.bytes</code> can enhance performance. The TCP read buffer size is rarely an issue but can also be significantly important.  This setting is bound by the <code>net.core.rmem_max</code> value.</p>
</li>
<li>
<p><strong>Use fast disks with enough space if using persistent queues:</strong> The queue sits between the input and filter stages in the same process. The I/O performance of the storage directly impacts the input throughput. When the queue is full, Logstash puts back pressure on the inputs to stall the data flow.</p>
</li>
</ul>
<h2>Second benchmark: Kafka input, filters, no outputs</h2>
<p>This benchmark helps measure the impact of the filters on the input throughput using the best achieved input configuration from the first exercise. It quantifies the throttling effect on the input stream only caused by the events processing. Note that <a href="https://www.elastic.co/guide/en/logstash/current/lookup-enrichment.html">some filter plugins</a> are also IO-bound, like the plugins that use the network to enrich the events.</p>
<pre><code>input {
 kafka {
   ...
 }
}
filter {
...
}
output {
}
</code></pre>
<p>To increase the number of simultaneously processed events by the filters, try increasing the number of pipeline workers and the pipeline batch size, particularly if the pipeline <code>worker\_utilization</code> <a href="https://www.elastic.co/guide/en/logstash/current/node-stats-api.html#plugin-flow-rates">flow metric</a> is near 100 and Logstash is not spending all available CPU. Increasing the workers number <a href="https://www.elastic.co/guide/en/logstash/current/tuning-logstash.html">past the number of available processors</a> can also yield better results as some of the filter plugins may spend significant time in an I/O wait state like external lookups. </p>
<p>Increasing the number of workers <a href="https://www.elastic.co/guide/en/logstash/current/tuning-logstash.html">past the number of available processors</a> can also improve performance, as some filter plugins may spend considerable time in an I/O wait state, such as during external lookups. This also makes a more efficient use of the Logstash host resources.</p>
<p>Optimizing the pipeline filters is the most effective approach to resolving this bottleneck. It can significantly reduce latency and increase the throughput regardless of the pipeline input configuration and Logstash resources. The per-plugin <code>worker_utilization</code> and <code>worker_millis_per_event</code> <a href="https://www.elastic.co/guide/en/logstash/current/node-stats-api.html#plugin-flow-rates">flow metrics</a> are very useful in identifying where most of the resources are being spent, and consequently, where these improvements should focus first.</p>
<p>Optimizing pipeline filters is the most effective way to address this bottleneck. it can significantly reduce latency and boost throughput, regardless of the pipeline's input configuration or available resources. The per-plugin <code>worker_utilization</code> and <code>worker_millis_per_event</code> flow metrics are useful for finding which plugins are spending the most resources, and the optimization efforts should focus on those plugins first. Some general best practices that can usually make improvements are utilizing <a href="https://www.elastic.co/blog/do-you-grok-grok">anchors</a> for Grok plugins, switching to faster plugins like <a href="https://www.elastic.co/blog/logstash-dude-wheres-my-chainsaw-i-need-to-dissect-my-logs">dissect</a> whenever possible, optimizing Ruby filters code, eliminating unnecessary parsing, and improving the network-based enrichments. </p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kafka-logstash-elasticsearch-performance-issues/image3.png" alt="" />
<em>Source: <a href="https://www.elastic.co/blog/do-you-grok-grok">do you grok</a></em></p>
<p>In some cases, optimizing the pipeline may require a complete redesign of the ingestion workflow or the pipeline itself!</p>
<h2>Third benchmark: Kafka input, no filters, Elasticsearch output</h2>
<p>This benchmark helps quantify the throttling effect of the Elasticsearch output on the input throughput. The test can be divided into two phases: the first phase uses raw logs to isolate the impact of Elasticsearch indexing, while the second phase assesses the impact of ingest pipelines.</p>
<p><em>In case a pipeline is using multiple outputs, note that,</em> <a href="https://www.elastic.co/guide/en/logstash/current/pipeline-to-pipeline.html#output-isolator-pattern"><em>by default</em></a><em>, a pipeline is blocked if any single output is blocked. This behavior is important in guaranteeing at-least-once delivery of data, but can cause the outputs to perform at the rate of the most clogged one.</em></p>
<pre><code>input {
 kafka {
   ...
 }
}
filter {
}
output {
 Elasticsearch {
   ...
 }
}
</code></pre>
<p>To increase throughput, consider progressively increasing the number of pipeline workers and the pipeline batch size. Prior guidance about the <code>worker\_utilization</code> flow metric applies here too although availability of CPU plays a smaller role since this output is mostly IO-bound.  Also keep looking for the Elasticsearch Output's rejection rates (e.g.: response code 429 `es_rejected_execution_exception` indicating explicit back-pressure) as a signal that the Elasticsearch cluster is busy processing other batches.</p>
<p>The Logstash output tries to send batches of events to the Elasticsearch Bulk API in a single request. However, if a batch exceeds 20 MB, the plugin splits it into multiple bulk requests. </p>
<p>If the Elasticsearch cluster is behind a proxy or API gateway, it's important to adjust the proxy limits to allow Logstash requests with large payloads to pass through to the Elasticsearch cluster. By default, most proxy servers have a much smaller maximum size for HTTP request payloads, which should be tuned in this case to accommodate larger requests. To identify potential issues, look for error code 413 in your proxy logs, as this indicates that the size of the Logstash request has exceeded the maximum payload size the proxy is configured to handle.</p>
<p>On the Elasticsearch cluster, tune your ingest pipelines efficiency following the same general best practices discussed above for the Logstash pipelines. Also, <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-indexing-speed.html">tune for the indexing speed</a> by using faster hardware, less index refreshes, auto-generated IDs, and consider increasing the number of primary shards to enhance indexing parallelism if you have multiple nodes. Beware that excessively increasing this number can negatively impact the search performance.</p>
<p>Finally, keep in mind that the Elasticsearch output plugin is mostly IO-bound, which means that your network latency and bandwidth significantly reduce the rate at which data is transferred and hence your output throughput.</p>
<h2>Reassemble your pipeline</h2>
<p>After tuning the pipeline in each of the previous phases separately, put all the parts together again to assess the real throughput and latency of the reassembled pipeline. At this last step, you should have reached the best performance from your Logstash host as well, and you can progressively add more instances to reach the ultimate latency and throughput you are aiming for for a specific topic or data source.  </p>
<h2>Example</h2>
<p>Below is an example of the configuration required on Logstash and Elasticsearch to implement the architecture above.</p>
<p>Logstash pipeline:</p>
<pre><code>input {
 kafka {
   bootstrap_servers =&gt; &quot;&lt;server&gt;:&lt;port&gt;&quot;
   topics =&gt; [&quot;&lt;topic-id&gt;&quot;]
   group_id =&gt; &quot;&lt;consumer-group-id&gt;&quot;
   decorate_events =&gt; &quot;extended&quot;
   auto_offset_reset =&gt; &quot;earliest&quot;
   codec =&gt; json {
   }
 }
}


filter {
 ruby {
   code =&gt; &quot;event.set('[logstash][start]', Time.now());&quot;
 }


 mutate {
   add_field =&gt; {
     &quot;[kafka][timestamp]&quot; =&gt; &quot;%{[@metadata][kafka][timestamp]}&quot;
     &quot;[kafka][offset]&quot; =&gt; &quot;%{[@metadata][kafka][offset]}&quot;
     &quot;[kafka][consumer_group]&quot; =&gt; &quot;%{[@metadata][kafka][consumer_group]}&quot;
     &quot;[kafka][topic]&quot; =&gt; &quot;%{[@metadata][kafka][topic]}&quot;
   }
 }


 date {
   match =&gt; [&quot;[kafka][timestamp]&quot;, &quot;UNIX&quot;, &quot;UNIX_MS&quot;]
   target =&gt; &quot;[kafka][timestamp]&quot;
 }
 ...
 ruby {
   code =&gt; &quot;event.set('[logstash][end]', Time.now());&quot;
 }
}


output {
 elasticsearch {
   hosts =&gt; &quot;hosts&quot;
   api_key =&gt; &quot;api_key&quot;
   data_stream =&gt; true
   ssl =&gt; true
 }
}
</code></pre>
<p>Create an ingest pipeline for lag calculation. Note that when using Elastic integrations, the ECS fields: \ <code>\*.end\</code>, \ <code>\*.start\</code>, \ <code>\*.timestamp\</code> are automatically mapped as a date.</p>
<pre><code>PUT _ingest/pipeline/calculate_ingest_lag
{
  &quot;processors&quot;: [
    {
      &quot;set&quot;: {
        &quot;field&quot;: &quot;event.ingested&quot;,
        &quot;value&quot;: &quot;{{_ingest.timestamp}}&quot;,
        &quot;ignore_failure&quot;: true
      }
    },
    {
      &quot;script&quot;: {
        &quot;lang&quot;: &quot;painless&quot;,
        &quot;if&quot;: &quot;ctx['@timestamp'] != null &amp;&amp; ctx?.kafka?.timestamp != null &amp;&amp; ctx?.logstash?.start != null &amp;&amp; ctx?.logstash?.end != null &amp;&amp; ctx?.event?.ingested != null&quot;,
        &quot;source&quot;: &quot;&quot;&quot; 
  ctx.lag_in_millis = [:];
              ctx.lag_in_millis.src_kfk = Duration.between(ZonedDateTime.parse(ctx['@timestamp']), ZonedDateTime.parse(ctx['kafka']['timestamp'])).toMillis(); 
              ctx.lag_in_millis.kfk_ls = Duration.between(ZonedDateTime.parse(ctx['kafka']['timestamp']), ZonedDateTime.parse(ctx['logstash']['start'])).toMillis();
              ctx.lag_in_millis.within_ls  = Duration.between(ZonedDateTime.parse(ctx['logstash']['start']), ZonedDateTime.parse(ctx['logstash']['end'])).toMillis();
              ctx.lag_in_millis.ls_es = Duration.between(ZonedDateTime.parse(ctx['logstash']['end']), ZonedDateTime.parse(ctx['event']['ingested'])).toMillis(); 
              ctx.lag_in_millis.end_end = Duration.between(ZonedDateTime.parse(ctx['@timestamp']), ZonedDateTime.parse(ctx['event']['ingested'])).toMillis();     
        &quot;&quot;&quot;
      }
    }
  ]
}
</code></pre>
<p>Use the pipeline to add the lag calculation to your Elastic integrations</p>
<pre><code>PUT _ingest/pipeline/logs-system.integration@custom
{
  &quot;processors&quot;: [
    {
      &quot;pipeline&quot;: {
        &quot;name&quot;: &quot;calculate_ingest_lag&quot;,
        &quot;ignore_missing_pipeline&quot;: true,
        &quot;description&quot;: &quot;add ingest lag calculation to elastic_agent integration&quot;
      }
    }
  ]
}
</code></pre>
<h2>Kibana Dashboard and Alerts</h2>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kafka-logstash-elasticsearch-performance-issues/image6.png" alt="" />
<img src="https://www.elastic.co/observability-labs/assets/images/kafka-logstash-elasticsearch-performance-issues/image2.png" alt="" /></p>
<p>Using the metrics mentioned above along with the <a href="https://www.elastic.co/guide/en/observability/current/inspect-log-anomalies.html">Log Rate ML job</a>, you can set up <a href="https://www.elastic.co/guide/en/kibana/current/rule-types.html#observability-rules">Kibana alerts</a> to trigger when with anomalous changes in throughput or delays or simply when delays exceed defined thresholds.</p>
<h2>Time to try it out</h2>
<p>Start your <a href="https://cloud.elastic.co/registration?elektra=whats-new-elastic-7-14-blog">free 14-day trial of Elastic Cloud</a> to experience the latest version of <a href="https://www.elastic.co/security">Elastic</a>. Also, make sure to take advantage of the Elastic threat detection <a href="https://www.elastic.co/training/elastic-security-quick-start">training</a> to set yourself up for success.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/kafka-logstash-elasticsearch-performance-issues/cover-resized.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Kibana: How to create impactful visualisations with magic formulas ? (part 1)]]></title>
            <link>https://www.elastic.co/observability-labs/blog/kibana-impactful-visualizations-with-magic-formulas-part1</link>
            <guid isPermaLink="false">kibana-impactful-visualizations-with-magic-formulas-part1</guid>
            <pubDate>Mon, 09 Sep 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[We will see how magic math formulas in the Kibana Lens editor can help to highlight high values.]]></description>
            <content:encoded><![CDATA[<h2>Kibana: How to create impactful visualizations with magic formulas? (part 1)</h2>
<h3>Introduction</h3>
<p>In the previous blog post,<a href="https://www.elastic.co/blog/designing-intuitive-kibana-dashboards-as-a-non-designer"> Designing Intuitive Kibana Dashboards as a non-designer</a>, we highlighted the importance of creating intuitive dashboards. It demonstrated how simple changes (grouping themes, changing type charts, and more) can make a difference in understanding your data. When delivering courses like<a href="https://www.elastic.co/training/data-analysis-with-kibana"> Data Analysis with Kibana</a> or<a href="https://www.elastic.co/training/elastic-observability-engineer"> Elastic Observability Engineer</a> courses, we emphasize this blog post and how these changes help bring essential information to the surface. I like a complementary approach to reach this goal: using two colors to separate the highest data values from the common ones.</p>
<p>To illustrate this idea, we will use the <em>Sample flight data</em> dataset. Now, let’s compare two visualizations ranking the top 10 destination countries per total number of flights. Which visualization has a higher impact?</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kibana-impactful-visualizations-with-magic-formulas-part1/blog-1-dbg-excalidraw-flights-teaser-intro.png" alt="Flights: Top 10 destinations" /></p>
<p>If you chose the second one, you may be wondering how this was done with the Kibana Lens editor. While preparing for the certification last year, I found a way to achieve this result. The secret is using two different layers and some magic formulas. This post will explain how math in Lens formulas helps create two data-color visualizations.</p>
<p>We will start with the first example that emphasizes only the highest value of the dataset we are focusing on. The second example describes how to highlight other high values (as shown in the illustration above).</p>
<p><em>[Note: the tips explained in this blog post can be applied from v 7.15]</em></p>
<h2>Only the highest value&lt;a id=&quot;only-the-highest-value&quot;&gt;&lt;/a&gt;</h2>
<p>To understand how math helps to separate high values from common ones, let’s start with this first example: emphasizing only the highest value.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kibana-impactful-visualizations-with-magic-formulas-part1/blog-1-wbg-flights-1.1-teaser.png" alt="1.1 flights: " /></p>
<p>We start with a bar horizontal chart:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kibana-impactful-visualizations-with-magic-formulas-part1/blog-1-wbg-flights-1.1-kibana-bar-horizontal-setup.png" alt="1.1 flights: Lens bar horizontal chart" /></p>
<p>We need to identify the highest value of the scope we are currently examining. We will use one proper overall_* function: the <strong>overall_max()</strong>, a pipeline function (equivalent to a pipeline aggregation in Query DSL). </p>
<p>In our example, we group the flights by country(destination). This means we count the number of flights for each DestCountry (= 1 bucket). The <strong>overall_max()</strong> will select which bucket has the highest value. </p>
<p>The math trick here is to divide the number of flights per bucket by the maximum value found among all buckets. Only one bucket will return 1: the bucket matching the max value found by overall_max(). All the other buckets will return a value &lt; 1 and &gt;0. We use <strong>floor()</strong> to ensure any 0.xxx values are rounded to 0. </p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kibana-impactful-visualizations-with-magic-formulas-part1/blog-1-wbg-flights-1.1-explaination-floor.png" alt="1.1 flights: explaining floor()" /></p>
<p>Now, we can multiple it with a count() and we have our formula for the 1st layer!</p>
<p><strong><em>Layer 1</em></strong>: <code>count()*floor(count()/overall_max(count()))</code></p>
<p>From here, in Lens Editor, we duplicate the layer to adjust the formula of the second layer containing the rest of the data. We need to append another count() followed by the minus operator to the formula. This is the other trick. In this layer, we just need to ensure the highest value is not represented, which will happen only once. It is when count() = overall_max(), which is = 1 when we divide them.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kibana-impactful-visualizations-with-magic-formulas-part1/blog-1-wbg-flights-1.1-explaination-layer1-and-layer2.png" alt="1.1 flights: layer 1 + layer 2" /></p>
<p><strong><em>Layer 2</em></strong>: <code>count() - count()*floor(count()/overall_max(count()))</code></p>
<p>To achieve a nice merge of these two layers, we need to do the following adjustments in both:</p>
<ul>
<li>
<p>select <strong>bar horizontal stacked</strong></p>
</li>
<li>
<p>Vertical axis: change”Rank by” to Custom and ensure Rank function is “Count”</p>
</li>
</ul>
<p>Here is the final setup of the two layers:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kibana-impactful-visualizations-with-magic-formulas-part1/blog-1-wbg-flights-1.1-kibana-final-2layers-setup.png" alt="1.1 flights: 2layers setup" /></p>
<p><strong><em>Layer 1</em></strong>: <code>count()*floor(count()/overall_max(count()))</code></p>
<p><strong><em>Layer 2</em></strong>: <code>count() - count()*floor(count()/overall_max(count()))</code></p>
<p>This visualization also works well for time series data where you need to quickly highlight which time period (12h in the example below) had the highest number of flights:<br />
<img src="https://www.elastic.co/observability-labs/assets/images/kibana-impactful-visualizations-with-magic-formulas-part1/blog-1-wbg-flights-1.1-timeserie-example.png" alt="1.1 flights: timeseries example" /></p>
<h2>Above the surface&lt;a id=&quot;above-the-surface&quot;&gt;&lt;/a&gt;</h2>
<p>Building on what we have done earlier, we can extend the approach to get other high values above the surface. Let’s see which formula we used to create the visualization in the introduction:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kibana-impactful-visualizations-with-magic-formulas-part1/blog-1-dbg-excalidraw-flights-teaser-intro-s1.png" alt="2.1 Flights: Top 10 destinations" /></p>
<p>For this visualization, we used a property of the <strong>round()</strong> function. This function brings in only the values greater than 50% of the highest value.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kibana-impactful-visualizations-with-magic-formulas-part1/blog-1-wbg-flights-2.1-explaination-round.png" alt="2.1 flights: round() &gt; 50% of max explanation" /></p>
<p>Let's duplicate our first visualization and swap out the floor() function with round().</p>
<p><strong><em>Layer 1</em></strong>: <code>count()*round(count()/overall_max(count()))</code></p>
<p><strong><em>Layer 2</em></strong>: <code>count() - count()*round(count()/overall_max(count()))</code></p>
<p>It was an easy fix.<br />
What if we want to extend the first layer further by adding more high values?<br />
For instance, we would like all the values above the average.</p>
<p>To do this, we use <strong>overall_average</strong>() as a new reference value instead of the overall_max () reference to separate the eligible values in Layer 1.</p>
<p>As we are comparing against the average value among all the buckets, the division might return values greater than 1.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kibana-impactful-visualizations-with-magic-formulas-part1/blog-1-wbg-flights-2.2-explaination-floor.png" alt="2.2 flights: round() explanation" /></p>
<p>Here, the <strong>clamp</strong>() function nicely solves this issue. </p>
<p>According to the formula reference, clamp() &quot;limits the value from a minimum to maximum&quot;. Combining clamp() and floor() ensures that there are only two possible output values: either the minimum value ( 0 ) or the maximum value ( 1 ) given as parameters.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kibana-impactful-visualizations-with-magic-formulas-part1/blog-1-wbg-flights-2.2-explaination-clamp.png" alt="2.2 flights: clamp() explanation" /></p>
<p>Applied to our flights dataset, it highlights the country destinations that have more flights than the average:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kibana-impactful-visualizations-with-magic-formulas-part1/blog-1-dbg-excalidraw-flights-2.2-above-overall-average.png" alt="2.2 flights: above the overall average " /></p>
<p><strong><em>Layer 1</em></strong>: <code>count()*clamp(floor(count()/overall_average(count())),0,1)</code></p>
<p><strong><em>Layer 2</em></strong>: <code>count() - count()*clamp(floor(count()/overall_average(count())),0,1)</code></p>
<p>It also opens up options for using other dynamic references. For instance, we could place all the values greater than 60% of the highest above the surface ( &gt; <code>0.6*overall_max(count())</code>).
We can tune our formula as follow: </p>
<pre><code>
count()*clamp(floor(count()/(0.6*overall_max(count()) ) ),0,1)
</code></pre>
<h2>Conclusion&lt;a id=&quot;conclusion&quot;&gt;&lt;/a&gt;</h2>
<p>In the first part, we have seen the main tips allowing us to create a two-color histogram:</p>
<ul>
<li>
<p>Two layers: one for the highest value and one for the remaining values</p>
</li>
<li>
<p>Visualization type: bar horizontal/vertical <strong>stacked</strong></p>
</li>
<li>
<p>To separate the data we use a formula where only the highest value return 1 otherwise 0</p>
</li>
</ul>
<p> </p>
<p>Then in the second part, we have seen how we can extend this principle to embrace more high values above the surface. This approach can be summarized as follows:</p>
<ul>
<li>
<p>Start with layer 1 focusing on the high value: count()*&lt;formula returning 0 or 1&gt;</p>
</li>
<li>
<p>Duplicate the layer and adjust the formula:<br />
 ( count() - count()*&lt;formula returning 0 or 1&gt;)</p>
</li>
</ul>
<p>Finally, we provide 4 generic formulas that are ready to use to spice up your dashboards:</p>
<table>
<thead>
<tr>
<th></th>
<th align="center"></th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>1. Only the highest</strong></td>
<td align="center"></td>
</tr>
<tr>
<td>Layer 1</td>
<td align="center"><code>count()*floor(count()/overall_max(count()))</code></td>
</tr>
<tr>
<td>Layer 2</td>
<td align="center"><code>count() - count()*floor(count()/overall_max(count()))</code></td>
</tr>
</tbody>
</table>
<table>
<thead>
<tr>
<th></th>
<th align="center"></th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>2.1. Above the surface :</strong> high values (above 50% of the max value)</td>
<td align="center"></td>
</tr>
<tr>
<td>Layer 1</td>
<td align="center"><code>count()*floor(count()/overall_max(count()))</code></td>
</tr>
<tr>
<td>Layer 2</td>
<td align="center"><code>count() - count()*floor(count()/overall_max(count()))</code></td>
</tr>
</tbody>
</table>
<table>
<thead>
<tr>
<th></th>
<th align="center"></th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>2.2. Above the surface :</strong> all values above the overall average</td>
<td align="center"></td>
</tr>
<tr>
<td>Layer 1</td>
<td align="center"><code>count()*clamp(floor(count()/overall_average(count())),0,1)</code></td>
</tr>
<tr>
<td>Layer 2</td>
<td align="center"><code>count() - count()*clamp(floor(count()/overall_average(count())),0,1)</code></td>
</tr>
</tbody>
</table>
<table>
<thead>
<tr>
<th></th>
<th align="center"></th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>2.2. Above the surface :</strong> all the values greater than 60% of the highest</td>
<td align="center"></td>
</tr>
<tr>
<td>Layer 1</td>
<td align="center"><code>count()*clamp(floor(count()/(0.6*overall_max(count()) ) ),0,1)</code></td>
</tr>
<tr>
<td>Layer 2</td>
<td align="center"><code>count() - count()*clamp(floor(count()/(0.6*overall_max(count()) ) ),0,1)</code></td>
</tr>
</tbody>
</table>
<p>Try these examples out for yourself by signing up for a <a href="https://cloud.elastic.co/registration?elektra=10-common-questions-kibana-blog">free trial of Elastic Cloud</a> or <a href="https://www.elastic.co/downloads/">download</a> the self-managed version of the Elastic Stack for free. If you have additional questions about getting started, head on over to the <a href="https://discuss.elastic.co/c/elastic-stack/kibana/7">Kibana forum</a> or check out the <a href="https://www.elastic.co/guide/en/kibana/current/index.html">Kibana documentation guide</a>.<br />
In the next blog post, we will see how the new function <strong>ifelse</strong>() (introduced in version 8.6) will greatly simplify the creation of visualizations with more advanced formulas.</p>
<p><strong>References</strong>:</p>
<ul>
<li>
<p><a href="https://www.elastic.co/blog/designing-intuitive-kibana-dashboards-as-a-non-designer">Designing intuitive Kibana dashboards as a non-designer</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/guide/en/kibana/current/lens.html#lens-formulas">Kibana: Lens editor - use formula to perform math</a></p>
</li>
<li>
<p>Discovering the clamp() function <a href="https://discuss.elastic.co/t/if-condition-in-kibana-table-visualization/305751/5">in this discussion (Thanks Marco!)</a></p>
</li>
</ul>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/kibana-impactful-visualizations-with-magic-formulas-part1/kibana-magic-formulas-p1.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Process Kubernetes logs with ease using Elastic Streams]]></title>
            <link>https://www.elastic.co/observability-labs/blog/kubernetes-logs-elastic-streams-processing</link>
            <guid isPermaLink="false">kubernetes-logs-elastic-streams-processing</guid>
            <pubDate>Thu, 12 Mar 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how to process Kubernetes logs with Elastic Streams using conditional blocks, AI-generated Grok patterns, and selective drops to reduce noise and storage cost.]]></description>
            <content:encoded><![CDATA[<p>Streams, a new AI capability within Elastic Observability. Built on the Elasticsearch platform, it's designed for Site Reliability Engineers (SREs) to use logs as the primary signal for investigations, enabling faster answers and quicker issue resolution. For decades, logs have been considered too noisy, expensive, and complex to manage, and many observability vendors have treated them as a second-class citizen. Streams flips this script by transforming raw logs into your most valuable asset to immediately identify not only the root cause, but also the why behind the root cause to enable instant resolution.</p>
<p>Learn more from our previous article <a href="https://www.elastic.co/observability-labs/blog/elastic-observability-streams-ai-logs-investigations">Introducing Streams</a></p>
<p>Many SREs deploy on cloud native archtiectures. Kubernetes is essentially the baseline deployment architecture of choice. Yet Kubernetes logs are messy by default. A single (data)stream often mixes access logs, JSON payloads, health checks, and internal service chatter.</p>
<p>Elastic Streams gives you a faster path. You can isolate subsets of logs with conditionals, use AI to generate Grok patterns from real samples, and drop documents you do not need before they add storage and query cost.</p>
<h2>Why Kubernetes logs get messy fast</h2>
<p>The default Kubernetes container logs stream can contain data from many services at once. In one sample, you might see:</p>
<ul>
<li>HTTP access logs from application pods</li>
<li>Verbose worker or batch job status logs</li>
<li>Platform and container lifecycle events with different formats</li>
</ul>
<p>This is why &quot;one global parsing rule&quot; will fail. You need targeted processing logic per log shape or type of application.
Histrocially doing this kind of custom processing has been error prone and time consuming.</p>
<h2>What Streams Processing changes</h2>
<p>Streams Processing (available in 9.2 and later) moves this workflow into a live, interactive experience:</p>
<ul>
<li>You build conditions and processors in the UI</li>
<li>You validate each change against sample documents before saving</li>
<li>You can use AI to generate extraction patterns from selected logs</li>
</ul>
<p>The result is a safer way to iterate on parsing logic without guessing.</p>
<h2>Walkthrough: parse custom application logs</h2>
<p>We'll start from your Kubernetes stream (logs-kubernetes.containers_logs-default) and create a conditional block that scopes processing to one service.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-logs-elastic-streams-processing/01-conditional-filter-litellm.png" alt="Conditional block filtering Kubernetes logs for litellm before parsing in Elastic Streams" /></p>
<p>Once the condition is saved, it will automatically filter the sample data to a subset of logs that match the condition. This is indicate by the blue highlight in the preview.</p>
<p>Inside that block, we'll add a Grok processor and click <strong>Generate pattern</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-logs-elastic-streams-processing/02-generate-pattern-button.png" alt="Generate pattern button in Elastic Streams using AI to process Kubernetes logs" /></p>
<p>This agentic process will now use an LLM to generate a Grok pattern that will be used to parse the logs. By default this would be using the Elastic Inference Service, but you can configure it to use your own LLM.
Review the generated pattern and accept it once the sample set validates.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-logs-elastic-streams-processing/03-accept-generated-grok.png" alt="Accepting AI-generated Grok pattern after matching selected Uvicorn logs" /></p>
<h2>Walkthrough: drop noisy postgres-loadgen documents</h2>
<p>Not all logs are that important that we'd like to keep them around forever. For example, logs from a load testing tool like a load generator are not useful for long-term analysis, so let's drop those.</p>
<p>To do this we will add a second conditional block for logs you intentionally do not want to index long-term.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-logs-elastic-streams-processing/05-preview-selected-postgres-loadgen.png" alt="Selected tab preview of noisy postgres-loadgen documents before drop" /></p>
<p>Add a drop processor inside this block, then validate in the <strong>Dropped</strong> tab.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-logs-elastic-streams-processing/07-preview-dropped-tab.png" alt="Dropped tab preview showing noisy Kubernetes logs excluded from indexing" /></p>
<h2>Save safely with live simulation</h2>
<p>One of the most useful parts of Streams is the preview-first workflow. You can inspect matched, parsed, skipped, failed, and dropped samples before making the change live.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-logs-elastic-streams-processing/08-save-changes.png" alt="Save changes button after validating processing logic on live samples" /></p>
<h2>YAML mode and the equivalent API request</h2>
<p>The interactive builder works well for most edits, but advanced users can switch to YAML mode for direct control.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-logs-elastic-streams-processing/11-yaml-mode.png" alt="Switching from interactive builder to YAML mode in Streams processing" /></p>
<p>You can also open <strong>Equivalent API Request</strong> to copy the payload for automation and Infrastructure as Code workflows.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-logs-elastic-streams-processing/12-equivalent-api-request.png" alt="Equivalent API request panel for automating Streams processing" /></p>
<h2>A note on backwards compatibility</h2>
<p>Streams Processing builds on Elasticsearch ingest pipelines, so it works with the same ingestion model teams already use.</p>
<p>When you save processing changes, Streams appends logic through the stream processing pipeline model (for example via <code>@custom</code> conventions used by data streams). That means you can adopt conditionals, parsing, and selective dropping incrementally, without changing your Kubernetes log shippers.</p>
<h2>What's next?</h2>
<p>Streams Processing is consistently getting new processing capabilities. Check out the <a href="https://www.elastic.co/docs/solutions/observability/streams/streams">Streams documentation</a> for the latest updates.</p>
<p>Over the coming months more of this will be automated and moved to the background, reducing the manual effort required to process logs.</p>
<p>Another miletsone we're working towards is to offer this processing at read time, rather than write time. Using ES|QL this will enable you to iterate on your parsing logic without having to worry about committing changes that are harder to revert.</p>
<p>Also try this out by getting a free trial on <a href="https://cloud.elastic.co/">Elastic Serverless</a>.</p>
<p>Happy log analytics!!!</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/kubernetes-logs-elastic-streams-processing/cover.svg" length="0" type="image/svg"/>
        </item>
        <item>
            <title><![CDATA[Serverless log analytics powered by Elasticsearch, in a new low priced tier]]></title>
            <link>https://www.elastic.co/observability-labs/blog/log-analytics-elastic-serverless-logs-essentials</link>
            <guid isPermaLink="false">log-analytics-elastic-serverless-logs-essentials</guid>
            <pubDate>Thu, 07 Aug 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Elastic Observability Logs Essentials delivers cost-effective, hassle-free log analytics on Elastic Cloud Serverless. SREs can ingest, search, enrich, analyze, store, and act on logs without the operational overhead of managing the deployment.]]></description>
            <content:encoded><![CDATA[<p>We're thrilled to introduce Elastic Observability Logs Essentials (Logs Essentials), a new tier in Elastic Cloud Serverless (SaaS). Built on the same robust stateless architecture as Elastic Observability Complete, it’s designed for Site Reliability Engineers (SREs) and developers seeking powerful, efficient, and economical log analytics, without the overhead of managing the Elastic Stack. As the leader in log management, Elasticsearch powers this new tier with unmatched search and analytics. </p>
<p>Logs Essentials is ideal for teams that want Elastic’s speed and scale without paying for premium features or managing the Elastic Stack. With Elastic Cloud Serverless, there’s no infrastructure to manage, and pricing is simple and predictable, making it easy to get started, stay supported, and focus on solving problems faster.</p>
<h2>Unmatched value for log analytics</h2>
<p>Logs Essentials empowers SREs and developers with analytics capabilities designed to help them quickly pinpoint the root cause of issues. </p>
<ul>
<li>
<p>Accelerate root cause analysis with fast, precise log search using filters, pattern matching, and event identification in seconds.</p>
</li>
<li>
<p>Gain deep contextual insights through ES|QL, Elastic’s powerful piped query language that supports structured exploration and joins across indices.</p>
</li>
<li>
<p>Detect issues proactively by setting alerts for error spikes or unusual log volumes, enabling timely incident response.</p>
</li>
<li>
<p>Visualize and monitor operational health with rich dashboards built in Kibana, giving teams a clear and actionable view of system behavior.</p>
</li>
</ul>
<p>Once on Logs Essentials, if you need SLOs, AI/ML, AI Assistant, or other advanced features to analyze logs, you should upgrade to <a href="https://www.elastic.co/pricing/serverless-observability">Observability Complete</a>. Additionally, if you are also interested in expanding to traces and metrics, you should upgrade to Observability Complete.</p>
<h2>SaaS making it simple</h2>
<p>SREs don’t have to worry about managing the powerful Elastic Stack with Logs Essentials. <a href="https://www.elastic.co/blog/journey-to-build-elastic-cloud-serverless">Elastic Cloud Serverless </a>automatically scales and adjusts to needs seamlessly without impacting performance, all while keeping costs low. SREs don’t have to worry about the operational overhead of managing your deployment or being an Elastic Stack expert. SREs get the following benefits:</p>
<p><strong>No infrastructure to manage or scale:</strong> Elastic Cloud Serverless transitions from traditional stateful deployments to a fully stateless, autoscaling architecture, offloading storage to cloud-native object stores and orchestrating compute through Kubernetes. SRE teams can now focus solely on logs and insights, not capacity planning or cluster sizing.</p>
<p><strong>High reliability, resilience, and automation built-in:</strong> Elastic’s Cloud Serverless features multi-region deployments, automated control-plane and data-plane upgrades, automatic configuration updates, canary deployments, and capacity pool management to ensure always-on observability</p>
<p>These capabilities deliver what SREs need: a hassle-free, scale-as-you-go, high-availability logging solution that empowers SREs to focus entirely on operational insights, not infrastructure.</p>
<h2>Affordable log analytics</h2>
<p>Logs Essentials offers a cost-effective and predictable path to log analytics. Elastic Cloud Serverless employs advanced autoscaling controllers that adjust compute and storage dynamically, enabling a flexible pricing model that charges based on real usage (ingest and retention), enabling SREs to “sign up and use,” without upfront provisioning or surprise costs. </p>
<p>Instead of paying for idle capacity or managing infrastructure costs, users are billed based on ingest, and retention, eliminating the guesswork and overprovisioning common in traditional observability solutions. SREs can simply sign up and start analyzing logs. No infrastructure to manage, no surprise costs, just transparent, cost-effective pricing for what they use.</p>
<h2>Logs Essentials in action</h2>
<p>Let’s walk through how a Site Reliability Engineer (SRE) would use it in a real-world scenario. Customers are unable to complete transactions on an ecommerce site and the root cause isn’t clear. The issue could be in the front end, the back end, the database, or even the load balancer. Fortunately, logs are being collected from multiple components including NGINX, MySQL, and the application itself. With Elastic Observability Logs Essentials, an SRE can quickly dive into these logs to investigate the issue by starting with high-level symptoms and drill down across services using powerful search, correlation via ES|QL, and visualization tools like dashboarding.</p>
<p>The investigation continues as the SRE walks through several steps using ES|QL, search, and dashboards.</p>
<ul>
<li>There is an alert indicating a logs spike, which is triggered by a significant number of MySQL errors indicating that a database table “orders” is full. We also use ES|QL to understand how many errors have been seen in the last three hours. </li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/log-analytics-elastic-serverless-logs-essentials/logs-essentials-alerts.jpg" alt="Alerts" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/log-analytics-elastic-serverless-logs-essentials/logs-essentials-sql-error.jpg" alt="MySQL error" /></p>
<ul>
<li>Next, the SRE tries to understand the impact on customers and potential revenue by looking at how many http issues are occurring and what region is seeing it most. With a significant number of &gt;=400 and the US as the main region seeing the issue, this is revenue impacting.</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/log-analytics-elastic-serverless-logs-essentials/logs-essentials-nginx-400.jpg" alt="NGINX 400 Issues" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/log-analytics-elastic-serverless-logs-essentials/logs-essentials-geo.jpg" alt="GEO Analysis" /></p>
<ul>
<li>Next, the SRE looks at whether infrastructure is being impacted by finding the related Kubernetes cluster and pod. With this the SRE can further investigate whether the MySQL pod or the Kubernetes node is having CPU or memory utilization issues.</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/log-analytics-elastic-serverless-logs-essentials/logs-essentials-k8s.jpg" alt="K8S Cluster Analysis" /></p>
<p>SREs can also create visualizations and dashboards easily through Observability Logs Essentials’ ES|QL, discover, alerting, and dashboards capabilities.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/log-analytics-elastic-serverless-logs-essentials/logs-essentials-dashboard.jpg" alt="Dashboard" /></p>
<h2>Get started with Observability Logs Essentials</h2>
<p>By combining the trusted capabilities of Elasticsearch with the flexibility and scalability of Elastic Cloud Serverless offering, Log Essentials delivers a streamlined, cost-effective solution that helps teams resolve incidents faster and with greater clarity. Whether you're troubleshooting critical outages, monitoring service health, or building dashboards for proactive insight, Logs Essentials gives you the tools you need —  search, ES|QL, alerting, and visualization — in a package that’s simple to adopt and scale. </p>
<p>In order to get started, first <a href="https://cloud.elastic.co/serverless-registration">register on Elastic Cloud</a> and start a trial.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/log-analytics-elastic-serverless-logs-essentials/logs-essentials.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Convert Logstash pipelines to OpenTelemetry Collector Pipelines]]></title>
            <link>https://www.elastic.co/observability-labs/blog/logstash-to-otel</link>
            <guid isPermaLink="false">logstash-to-otel</guid>
            <pubDate>Fri, 25 Oct 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[This guide helps Logstash users transition to OpenTelemetry by demonstrating how to convert common Logstash pipelines into equivalent OpenTelemetry Collector configurations. We will focus on the log signal.]]></description>
            <content:encoded><![CDATA[<h1>Convert Logstash pipelines to OpenTelemetry Collector Pipelines</h1>
<h2>Introduction</h2>
<p>Elastic observability strategy is increasingly aligned with OpenTelemetry. With the recent launch of <a href="https://www.elastic.co/observability-labs/blog/elastic-distributions-opentelemetry">Elastic Distributions of OpenTelemetry</a> we’re expanding our offering to make it easier to use OpenTelemetry, the Elastic Agent now offers an <a href="https://www.elastic.co/guide/en/fleet/current/otel-agent.html">&quot;otel&quot; mode</a>, enabling it to run a custom distribution of the OpenTelemetry Collector, seamlessly enhancing your observability onboarding and experience with Elastic.</p>
<p>This post is designed to assist users familiar with Logstash transitioning to OpenTelemetry by demonstrating how to convert some standard Logstash pipelines into corresponding OpenTelemetry Collector configurations.</p>
<h2>What is OpenTelemetry Collector and why should I care?</h2>
<p><a href="https://opentelemetry.io/">OpenTelemetry</a> is an open-source framework that ensures vendor-agnostic data collection, providing a standardized approach for the collection, processing, and ingestion of observability data. Elastic is fully committed to this principle, aiming to make observability truly vendor-agnostic and eliminating the need for users to re-instrument their observability when switching platforms.</p>
<p>By embracing OpenTelemetry, you have access to  these benefits:</p>
<ul>
<li><strong>Unified Observability</strong>: By using the OpenTelemetry Collector, you can collect and manage logs, metrics, and traces from a single tool, providing holistic observability into your system's performance and behavior. This simplifies monitoring and debugging in complex, distributed environments like microservices.</li>
<li><strong>Flexibility and Scalability</strong>: Whether you're running a small service or a large distributed system, the OpenTelemetry Collector can be scaled to handle the amount of data generated, offering the flexibility to deploy as an agent (running alongside applications) or as a gateway (a centralized hub).</li>
<li><strong>Open Standards</strong>: Since OpenTelemetry is an open-source project under the Cloud Native Computing Foundation (CNCF), it ensures that you're working with widely accepted standards, contributing to the long-term sustainability and compatibility of your observability stack.</li>
<li><strong>Simplified Telemetry Pipelines</strong>: The ability to build pipelines using receivers, processors, and exporters simplifies telemetry management by centralizing data flows and minimizing the need for multiple agents.</li>
</ul>
<p>In the next sections, we will explain how OTEL Collector and Logstash pipelines are structured, and we will clarify how the steps for each option are used.</p>
<h2>OTEL Collector Configuration</h2>
<p>An OpenTelemetry Collector <a href="https://opentelemetry.io/docs/collector/configuration/">Configuration</a> has different sections:</p>
<ul>
<li><strong>Receivers</strong>: Collect data from different sources.</li>
<li><strong>Processors</strong>: Transform the data collected by receivers</li>
<li><strong>Exporters</strong>: Send data to different collectors</li>
<li><strong>Connectors</strong>: Link two pipelines together</li>
<li><strong>Service</strong>: defines which components are active
<ul>
<li><strong>Pipelines</strong>:  Combine the defined receivers, processors, exporters, and connectors to process the data</li>
<li><strong>Extensions</strong> are optional components that expand the capabilities of the Collector to accomplish tasks not directly involved with processing telemetry data (e.g., health monitoring)</li>
<li><strong>Telemetry</strong> where you can set observability for the collector itself (e.g., logging and monitoring)</li>
</ul>
</li>
</ul>
<p>We can visualize it schematically as follows:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/logstash-to-otel/otel-config-schema.png" alt="otel-config-schema" /></p>
<p>We refer to the official documentation <a href="https://opentelemetry.io/docs/collector/configuration/">Configuration | OpenTelemetry</a> for an in-depth introduction in the components.</p>
<h2>Logstash pipeline definition</h2>
<p>A <a href="https://www.elastic.co/guide/en/logstash/current/configuration-file-structure.html">Logstash pipeline</a> is composed of three main components:</p>
<ul>
<li>Input Plugins: Allow us to read data from different sources</li>
<li>Filters Plugins: Allow us to transform and filter the data</li>
<li>Output Plugins: Allow us to send the data</li>
</ul>
<p>Logstash also has a special input and a special output that allow the pipeline-to-pipeline communication, we can consider this as a similar concept to an OpenTelemetry connector.</p>
<h2>Logstash pipeline compared to Otel Collector components</h2>
<p>We can schematize how Logstash Pipeline and OTEL Collector pipeline components can relate to each other as follows:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/logstash-to-otel/logstash-pipeline-to-otel-pipeline.png" alt="logstash-pipeline-to-otel-pipeline" /></p>
<p>Enough theory! Let us dive into some examples.</p>
<h2>Convert a Logstash Pipeline into OpenTelemetry Collector Pipeline</h2>
<h3>Example 1: Parse and transform log line</h3>
<p>Let's consider the below line:</p>
<pre><code>2024-09-20T08:33:27: user frank accessed from 89.66.167.22:10592 path /blog with error 404
</code></pre>
<p>We will apply the following steps:</p>
<ol>
<li>Read the line from the file <code>/tmp/demo-line.log</code>.</li>
<li>Define the output to be an Elasticsearch datastream <code>logs-access-default</code>.</li>
<li>Extract the <code>@timestamp</code>, <code>user.name</code>, <code>client.ip</code>, <code>client.port</code>, <code>url.path</code> and <code>http.status.code</code>.</li>
<li>Drop log messages related to the <code>SYSTEM</code> user.</li>
<li>Parse the date timestamp with the relevant date format and store it in <code>@timestamp</code>.</li>
<li>Add a code <code>http.status.code_description</code> based on known codes' descriptions.</li>
<li>Send data to Elasticsearch.</li>
</ol>
<p><strong>Logstash pipeline</strong></p>
<pre><code class="language-ruby">input {
    file {
        path =&gt; &quot;/tmp/demo-line.log&quot; #[1]
        start_position =&gt; &quot;beginning&quot;
        add_field =&gt; { #[2]
            &quot;[data_stream][type]&quot; =&gt; &quot;logs&quot;
            &quot;[data_stream][dataset]&quot; =&gt; &quot;access_log&quot;
            &quot;[data_stream][namespace]&quot; =&gt; &quot;default&quot;
        }
    }
}

filter {
    grok { #[3]
        match =&gt; {
            &quot;message&quot; =&gt; &quot;%{TIMESTAMP_ISO8601:[date]}: user %{WORD:[user][name]} accessed from %{IP:[client][ip]}:%{NUMBER:[client][port]:int} path %{URIPATH:[url][path]} with error %{NUMBER:[http][status][code]}&quot;
        }
    }
    if &quot;_grokparsefailure&quot; not in [tags] {
        if [user][name] == &quot;SYSTEM&quot; { #[4]
            drop {}
        }
        date { #[5]
            match =&gt; [&quot;[date]&quot;, &quot;ISO8601&quot;]
            target =&gt; &quot;[@timestamp]&quot;
            timezone =&gt; &quot;UTC&quot;
            remove_field =&gt; [ &quot;date&quot; ]
        }
        translate { #[6]
            source =&gt; &quot;[http][status][code]&quot;
            target =&gt; &quot;[http][status][code_description]&quot;
            dictionary =&gt; {
                &quot;200&quot; =&gt; &quot;OK&quot;
                &quot;403&quot; =&gt; &quot;Permission denied&quot;
                &quot;404&quot; =&gt; &quot;Not Found&quot;
                &quot;500&quot; =&gt; &quot;Server Error&quot;
            }
            fallback =&gt; &quot;Unknown error&quot;
        }
    }
}

output {
    elasticsearch { #[7]
        hosts =&gt; &quot;elasticsearch-enpoint:443&quot;
        api_key =&gt; &quot;${ES_API_KEY}&quot;
    }
}
</code></pre>
<p><strong>OpenTelemtry Collector configuration</strong></p>
<pre><code class="language-yaml">receivers:
  filelog: #[1]
    start_at: beginning
    include:
      - /tmp/demo-line.log
    include_file_name: false
    include_file_path: true
    storage: file_storage 
    operators:
    # Copy the raw message into event.original (this is done OOTB by Logstash in ECS mode)
    - type: copy
      from: body
      to: attributes['event.original']
    - type: add #[2]
      field: attributes[&quot;data_stream.type&quot;]
      value: &quot;logs&quot;
    - type: add #[2]
      field: attributes[&quot;data_stream.dataset&quot;]
      value: &quot;access_log_otel&quot; 
    - type: add #[2]
      field: attributes[&quot;data_stream.namespace&quot;]
      value: &quot;default&quot;

extensions:
  file_storage:
    directory: /var/lib/otelcol/file_storage

processors:
  # Adding  host.name (this is done OOTB by Logstash)
  resourcedetection/system:
    detectors: [&quot;system&quot;]
    system:
      hostname_sources: [&quot;os&quot;]
      resource_attributes:
        os.type:
          enabled: false

  transform/grok: #[3]
    log_statements:
      - context: log
        statements:
        - 'merge_maps(attributes, ExtractGrokPatterns(attributes[&quot;event.original&quot;], &quot;%{TIMESTAMP_ISO8601:date}: user %{WORD:user.name} accessed from %{IP:client.ip}:%{NUMBER:client.port:int} path %{URIPATH:url.path} with error %{NUMBER:http.status.code}&quot;, true), &quot;insert&quot;)'

  filter/exclude_system_user:  #[4]
    error_mode: ignore
    logs:
      log_record:
        - attributes[&quot;user.name&quot;] == &quot;SYSTEM&quot;

  transform/parse_date: #[5]
    log_statements:
      - context: log
        statements:
          - set(time, Time(attributes[&quot;date&quot;], &quot;%Y-%m-%dT%H:%M:%S&quot;))
          - delete_key(attributes, &quot;date&quot;)
        conditions:
          - attributes[&quot;date&quot;] != nil

  transform/translate_status_code:  #[6]
    log_statements:
      - context: log
        conditions:
        - attributes[&quot;http.status.code&quot;] != nil
        statements:
        - set(attributes[&quot;http.status.code_description&quot;], &quot;OK&quot;)                where attributes[&quot;http.status.code&quot;] == &quot;200&quot;
        - set(attributes[&quot;http.status.code_description&quot;], &quot;Permission Denied&quot;) where attributes[&quot;http.status.code&quot;] == &quot;403&quot;
        - set(attributes[&quot;http.status.code_description&quot;], &quot;Not Found&quot;)         where attributes[&quot;http.status.code&quot;] == &quot;404&quot;
        - set(attributes[&quot;http.status.code_description&quot;], &quot;Server Error&quot;)      where attributes[&quot;http.status.code&quot;] == &quot;500&quot;
        - set(attributes[&quot;http.status.code_description&quot;], &quot;Unknown Error&quot;)     where attributes[&quot;http.status.code_description&quot;] == nil

exporters:
  elasticsearch: #[7]
    endpoints: [&quot;elasticsearch-enpoint:443&quot;]
    api_key: ${env:ES_API_KEY}
    tls:
    logs_dynamic_index:
      enabled: true
    mapping:
      mode: ecs

service:
  extensions: [file_storage]
  pipelines:
    logs:
      receivers:
        - filelog
      processors:
        - resourcedetection/system
        - transform/grok
        - filter/exclude_system_user
        - transform/parse_date
        - transform/translate_status_code
      exporters:
        - elasticsearch
</code></pre>
<p>These will generate the following document in Elasticsearch</p>
<pre><code class="language-json">{
    &quot;@timestamp&quot;: &quot;2024-09-20T08:33:27.000Z&quot;,
    &quot;client&quot;: {
        &quot;ip&quot;: &quot;89.66.167.22&quot;,
        &quot;port&quot;: 10592
    },
    &quot;data_stream&quot;: {
        &quot;dataset&quot;: &quot;access_log&quot;,
        &quot;namespace&quot;: &quot;default&quot;,
        &quot;type&quot;: &quot;logs&quot;
    },
    &quot;event&quot;: {
        &quot;original&quot;: &quot;2024-09-20T08:33:27: user frank accessed from 89.66.167.22:10592 path /blog with error 404&quot;
    },
    &quot;host&quot;: {
        &quot;hostname&quot;: &quot;my-laptop&quot;,
        &quot;name&quot;: &quot;my-laptop&quot;,
     },
    &quot;http&quot;: {
        &quot;status&quot;: {
            &quot;code&quot;: &quot;404&quot;,
            &quot;code_description&quot;: &quot;Not Found&quot;
        }
    },
    &quot;log&quot;: {
        &quot;file&quot;: {
            &quot;path&quot;: &quot;/tmp/demo-line.log&quot;
        }
    },
    &quot;message&quot;: &quot;2024-09-20T08:33:27: user frank accessed from 89.66.167.22:10592 path /blog with error 404&quot;,
    &quot;url&quot;: {
        &quot;path&quot;: &quot;/blog&quot;
    },
    &quot;user&quot;: {
        &quot;name&quot;: &quot;frank&quot;
    }
}
</code></pre>
<h3>Example 2: Parse and transform a NDJSON-formatted log file</h3>
<p>Let's consider the below json line:</p>
<pre><code class="language-json">{&quot;log_level&quot;:&quot;INFO&quot;,&quot;message&quot;:&quot;User login successful&quot;,&quot;service&quot;:&quot;auth-service&quot;,&quot;timestamp&quot;:&quot;2024-10-11 12:34:56.123 +0100&quot;,&quot;user&quot;:{&quot;id&quot;:&quot;A1230&quot;,&quot;name&quot;:&quot;john_doe&quot;}}
</code></pre>
<p>We will apply the following steps:</p>
<ol>
<li>Read a line from the file <code>/tmp/demo.ndjson</code>.</li>
<li>Define the output to be an Elasticsearch datastream <code>logs-json-default</code></li>
<li>Parse the JSON and assign relevant keys and values.</li>
<li>Parse the date.</li>
<li>Override the message field.</li>
<li>Rename fields to follow ECS conventions.</li>
<li>Send data to Elasticsearch.</li>
</ol>
<p><strong>Logstash pipeline</strong></p>
<pre><code class="language-ruby">input {
    file {
        path =&gt; &quot;/tmp/demo.ndjson&quot; #[1]
        start_position =&gt; &quot;beginning&quot;
        add_field =&gt; { #[2]
            &quot;[data_stream][type]&quot; =&gt; &quot;logs&quot;
            &quot;[data_stream][dataset]&quot; =&gt; &quot;json&quot;
            &quot;[data_stream][namespace]&quot; =&gt; &quot;default&quot;
        }
    }
}

filter {
  if [message] =~ /^\{.*/ {
    json { #[3] &amp; #[5]
        source =&gt; &quot;message&quot;
    }
  }
  date { #[4]
    match =&gt; [&quot;[timestamp]&quot;, &quot;yyyy-MM-dd HH:mm:ss.SSS Z&quot;]
    remove_field =&gt; &quot;[timestamp]&quot;
  }
  mutate {
    rename =&gt; { #[6]
      &quot;service&quot; =&gt; &quot;[service][name]&quot;
      &quot;log_level&quot; =&gt; &quot;[log][level]&quot;
    }
  }
}


output {
    elasticsearch { # [7]
        hosts =&gt; &quot;elasticsearch-enpoint:443&quot;
        api_key =&gt; &quot;${ES_API_KEY}&quot;
    }
}
</code></pre>
<p><strong>OpenTelemtry Collector configuration</strong></p>
<pre><code class="language-yaml">receivers:
  filelog/json: # [1]
    include: 
      - /tmp/demo.ndjson
    retry_on_failure:
      enabled: true
    start_at: beginning
    storage: file_storage 
    operators:
     # Copy the raw message into event.original (this is done OOTB by Logstash in ECS mode)
    - type: copy
      from: body
      to: attributes['event.original']
    - type: add #[2]
      field: attributes[&quot;data_stream.type&quot;]
      value: &quot;logs&quot;      
    - type: add #[2]
      field: attributes[&quot;data_stream.dataset&quot;]
      value: &quot;otel&quot; #[2]
    - type: add
      field: attributes[&quot;data_stream.namespace&quot;]
      value: &quot;default&quot;     


extensions:
  file_storage:
    directory: /var/lib/otelcol/file_storage

processors:
  # Adding  host.name (this is done OOTB by Logstash)
  resourcedetection/system:
    detectors: [&quot;system&quot;]
    system:
      hostname_sources: [&quot;os&quot;]
      resource_attributes:
        os.type:
          enabled: false

  transform/json_parse:  #[3]
    error_mode: ignore
    log_statements:
      - context: log
        statements:
          - merge_maps(attributes, ParseJSON(body), &quot;upsert&quot;)
        conditions: 
          - IsMatch(body, &quot;^\\{&quot;)
      

  transform/parse_date:  #[4]
    error_mode: ignore
    log_statements:
      - context: log
        statements:
          - set(time, Time(attributes[&quot;timestamp&quot;], &quot;%Y-%m-%d %H:%M:%S.%L %z&quot;))
          - delete_key(attributes, &quot;timestamp&quot;)
        conditions: 
          - attributes[&quot;timestamp&quot;] != nil

  transform/override_message_field: [5]
    error_mode: ignore
    log_statements:
      - context: log
        statements:
          - set(body, attributes[&quot;message&quot;])
          - delete_key(attributes, &quot;message&quot;)

  transform/set_log_severity: # [6]
    error_mode: ignore
    log_statements:
      - context: log
        statements:
          - set(severity_text, attributes[&quot;log_level&quot;])          

  attributes/rename_attributes: #[6]
    actions:
      - key: service.name
        from_attribute: service
        action: insert
      - key: service
        action: delete
      - key: log_level
        action: delete

exporters:
  elasticsearch: #[7]
    endpoints: [&quot;elasticsearch-enpoint:443&quot;]
    api_key: ${env:ES_API_KEY}
    tls:
    logs_dynamic_index:
      enabled: true
    mapping:
      mode: ecs

service:
  extensions: [file_storage]
  pipelines:
    logs/json:
      receivers: 
        - filelog/json
      processors:
        - resourcedetection/system    
        - transform/json_parse
        - transform/parse_date        
        - transform/override_message_field
        - transform/set_log_severity
        - attributes/rename_attributes
      exporters: 
        - elasticsearch

</code></pre>
<p>These will generate the following document in Elasticsearch</p>
<pre><code class="language-json">{
    &quot;@timestamp&quot;: &quot;2024-10-11T12:34:56.123000000Z&quot;,
    &quot;data_stream&quot;: {
        &quot;dataset&quot;: &quot;otel&quot;,
        &quot;namespace&quot;: &quot;default&quot;,
        &quot;type&quot;: &quot;logs&quot;
    },
    &quot;event&quot;: {
        &quot;original&quot;: &quot;{\&quot;log_level\&quot;:\&quot;WARNING\&quot;,\&quot;message\&quot;:\&quot;User login successful\&quot;,\&quot;service\&quot;:\&quot;auth-service\&quot;,\&quot;timestamp\&quot;:\&quot;2024-10-11 12:34:56.123 +0100\&quot;,\&quot;user\&quot;:{\&quot;id\&quot;:\&quot;A1230\&quot;,\&quot;name\&quot;:\&quot;john_doe\&quot;}}&quot;
    },
    &quot;host&quot;: {
        &quot;hostname&quot;: &quot;my-laptop&quot;,
        &quot;name&quot;: &quot;my-laptop&quot;,
     },
    &quot;log&quot;: {
        &quot;file&quot;: {
            &quot;name&quot;: &quot;json.log&quot;
        },
        &quot;level&quot;: &quot;WARNING&quot;
    },
    &quot;message&quot;: &quot;User login successful&quot;,
    &quot;service&quot;: {
        &quot;name&quot;: &quot;auth-service&quot;
    },
    &quot;user&quot;: {
        &quot;id&quot;: &quot;A1230&quot;,
        &quot;name&quot;: &quot;john_doe&quot;
    }
}

</code></pre>
<h2>Conclusion</h2>
<p>In this post, we showed examples of how to convert a typical Logstash pipeline into an OpenTelemetry Collector pipeline for logs. While OpenTelemetry provides powerful tools for collecting and exporting logs, if your pipeline relies on complex transformations or scripting, Logstash remains a superior choice. This is because Logstash offers a broader range of built-in features and a more flexible approach to handling advanced data manipulation tasks.</p>
<h2>What's Next?</h2>
<p>Now that you've seen basic (but realistic) examples of converting a Logstash pipeline to OpenTelemetry, it's your turn to dive deeper. Depending on your needs, you can explore further and find more detailed resources in the following repositories:</p>
<ul>
<li><a href="https://github.com/open-telemetry/opentelemetry-collector">OpenTelemetry Collector</a>: Learn about the core OpenTelemetry components, from receivers to exporters.</li>
<li><a href="https://github.com/open-telemetry/opentelemetry-collector-contrib">OpenTelemetry Collector Contrib</a>: Find community-contributed components for a wider range of integrations and features.</li>
<li><a href="https://github.com/elastic/opentelemetry-collector-components">Elastic's opentelemetry-collector-components</a>: Dive into Elastic's extensions for the OpenTelemetry Collector, offering more tailored features for Elastic Stack users.</li>
</ul>
<p>If you encounter specific challenges or need to handle more advanced use cases, these repositories will be an excellent resource for discovering additional components or integrations that can enhance your pipeline. All these repositories have a similar structure with folders named <code>receiver</code>, <code>processor</code>, <code>exporter</code>, <code>connector</code>, which should be familiar after reading this blog. Whether you are migrating a simple Logstash pipeline or tackling more complex data transformations, these tools and communities will provide the support you need for a successful OpenTelemetry implementation.</p>]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/logstash-to-otel/logstash-otel.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Migrating 1 billion log lines from OpenSearch to Elasticsearch]]></title>
            <link>https://www.elastic.co/observability-labs/blog/migrating-billion-log-lines-opensearch-elasticsearch</link>
            <guid isPermaLink="false">migrating-billion-log-lines-opensearch-elasticsearch</guid>
            <pubDate>Wed, 11 Oct 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how to migrate 1 billion log lines from OpenSearch to Elasticsearch for improved performance and reduced disk usage. Discover the migration strategies, data transfer methods, and optimization techniques used in this guide.]]></description>
            <content:encoded><![CDATA[<p>What are the current options to migrate from OpenSearch to Elasticsearch&lt;sup&gt;®&lt;/sup&gt;?</p>
<p>OpenSearch is a fork of Elasticsearch 7.10 that has diverged quite a bit from itself lately, resulting in a different set of features and also different performance, as <a href="https://www.elastic.co/blog/elasticsearch-opensearch-performance-gap">this benchmark</a> shows (hint: it’s currently much slower than Elasticsearch).</p>
<p>Given the differences between the two solutions, restoring a snapshot from OpenSearch is not possible, nor is reindex-from-remote, so our only option is then using something in between that will read from OpenSearch and write to Elasticsearch.</p>
<p>This blog will show you how easy it is to migrate from OpenSearch to Elasticsearch for better performance and less disk usage!</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/migrating-billion-log-lines-opensearch-elasticsearch/blog-elastic-348gb-disk-space-logs.jpg" alt="1 - arrows" /></p>
<h2>1 billion log lines</h2>
<p>We are going to use part of the data set we used for the benchmark, which takes about half a terabyte on disk, including replicas, and spans over a week ( January 1–7, 2023).</p>
<p>We have in total 1,009,165,775 documents that take <strong>453.5GB</strong> of space in OpenSearch, including the replicas. That’s <strong>241.2KB per document</strong>. This is going to be important later when we enable a couple optimizations in Elasticsearch that will bring this total size way down without sacrificing performance!</p>
<p>This billion log line data set is spread over nine indices that are part of a datastream we are calling logs-myapplication-prod. We have primary shards of about 25GB in size, according to the best practices for optimal shard sizing. A GET _cat/indices show us the indices we are dealing with:</p>
<pre><code class="language-bash">index                              docs.count pri rep pri.store.size store.size
.ds-logs-myapplication-prod-000049  102519334   1   1         22.1gb     44.2gb
.ds-logs-myapplication-prod-000048  114273539   1   1         26.1gb     52.3gb
.ds-logs-myapplication-prod-000044  111093596   1   1         25.4gb     50.8gb
.ds-logs-myapplication-prod-000043  113821016   1   1         25.7gb     51.5gb
.ds-logs-myapplication-prod-000042  113859174   1   1         24.8gb     49.7gb
.ds-logs-myapplication-prod-000041  112400019   1   1         25.7gb     51.4gb
.ds-logs-myapplication-prod-000040  113362823   1   1         25.9gb     51.9gb
.ds-logs-myapplication-prod-000038  110994116   1   1         25.3gb     50.7gb
.ds-logs-myapplication-prod-000037  116842158   1   1         25.4gb     50.8gb
</code></pre>
<p>Both OpenSearch and Elasticsearch clusters have the same configuration: 3 nodes with 64GB RAM and 12 CPU cores. Just like in the <a href="https://www.elastic.co/blog/elasticsearch-opensearch-performance-gap">benchmark</a>, the clusters are running in Kubernetes.</p>
<h2>Moving data from A to B</h2>
<p>Typically, moving data from one Elasticsearch cluster to another is easy as a <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/snapshot-restore.html">snapshot and restore</a> if the clusters are compatible versions of each other or a <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html#reindex-from-remote">reindex from remote</a> if you need real-time synchronization and minimized downtime. These methods do not apply when migrating data from OpenSearch to Elasticsearch because the projects have significantly diverged from the 7.10 fork. However, there is one method that will work: scrolling.</p>
<h3>Scrolling</h3>
<p>Scrolling involves using an external tool, such as Logstash&lt;sup&gt;®&lt;/sup&gt;, to read data from the source cluster and write it to the destination cluster. This method provides a high degree of customization, allowing us to transform the data during the migration process if needed. Here are a couple advantages of using Logstash:</p>
<ul>
<li><strong>Easy parallelization:</strong> It’s really easy to write concurrent jobs that can read from different “slices” of the indices, essentially maximizing our throughput.</li>
<li><strong>Queuing:</strong> Logstash automatically queues documents before sending.</li>
<li><strong>Automatic retries:</strong> In the event of a failure or an error during data transmission, Logstash will automatically attempt to resend the data; moreover, it will stop querying the source cluster as often, until the connection is re-established, all without manual intervention.</li>
</ul>
<p>Scrolling allows us to do an initial search and to keep pulling batches of results from Elasticsearch until there are no more results left, similar to how a “cursor” works in relational databases.</p>
<p>A <a href="https://www.elastic.co/guide/en/elasticsearch/guide/master/scroll.html">scrolled search</a> takes a snapshot in time by freezing the segments that make the index up until the time the request is made, preventing those segments from merging. As a result, the scroll doesn’t see any changes that are made to the index after the initial search request has been made.</p>
<h3>Migration strategies</h3>
<p>Reading from A and writing in B in can be slow without optimization because it involves paginating through the results, transferring each batch over the network to Logstash, which will assemble the documents in another batch and then transfer those batches over the network again to Elasticsearch, where the documents will be indexed. So when it comes to such large data sets, we must be very efficient and extract every bit of performance where we can.</p>
<p>Let’s start with the facts — what do we know about the data we need to transfer? We have nine indices in the datastream, each with about 100 million documents. Let’s test with just one of the indices and measure the indexing rate to see how long it takes to migrate. The indexing rate can be seen by activating the <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/monitoring-overview.html">monitoring</a> functionality in Elastic&lt;sup&gt;®&lt;/sup&gt; and then navigating to the index you want to inspect.</p>
<p><strong>Scrolling in the deep</strong><br />
The simplest approach for transferring the log lines over would be to make Elasticsearch scroll over the entire data set and check it later when it finishes. Here we will introduce our first two variables: PAGE_SIZE and BATCH_SIZE. The former is how many records we are going to bring from the source every time we query it, and the latter is how many documents are going to be assembled together by Logstash and written to the destination index.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/migrating-billion-log-lines-opensearch-elasticsearch/elastic-blog-2-scrolling-in-the-deep.jpg" alt="Deep scrolling" /></p>
<p>With such a large data set, the scroll slows down as this deep pagination progresses. The indexing rate starts at 6,000 docs/second and steadily descends down to 700 docs/second because the pagination gets very deep. Without any optimization, it would take us 19 days (!) to migrate the 1 billion documents. We can do better than that!</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/migrating-billion-log-lines-opensearch-elasticsearch/elastic-blog-3-index-rate.png" alt="Indexing rate for a deep scroll" /></p>
<p><strong>Slice me nice</strong><br />
We can optimize scrolling by using an approach called <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/paginate-search-results.html#slice-scroll">Sliced scroll</a>, where we split the index in different slices to consume them independently.</p>
<p>Here we will introduce our last two variables: SLICES and WORKERS. The amount of slices cannot be too small as the performance decreases drastically over time, and it can’t be too big as the overhead of maintaining the scrolls would counter the benefits of a smaller search.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/migrating-billion-log-lines-opensearch-elasticsearch/elastic-blog-4-slice-me-nice.jpg" alt="Sliced scroll" /></p>
<p>Let’s start by migrating a single index (out of the nine we have) with different parameters to see what combination gives us the highest throughput.</p>
<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>SLICES</td>
<td>PAGE_SIZE</td>
<td>WORKERS</td>
<td>BATCH_SIZE</td>
<td>Average Indexing Rate</td>
</tr>
<tr>
<td>3</td>
<td>500</td>
<td>3</td>
<td>500</td>
<td>13,319 docs/sec</td>
</tr>
<tr>
<td>3</td>
<td>1,000</td>
<td>3</td>
<td>1,000</td>
<td>13,048 docs/sec</td>
</tr>
<tr>
<td>4</td>
<td>250</td>
<td>4</td>
<td>250</td>
<td>10,199 docs/sec</td>
</tr>
<tr>
<td>4</td>
<td>500</td>
<td>4</td>
<td>500</td>
<td>12,692 docs/sec</td>
</tr>
<tr>
<td>4</td>
<td>1,000</td>
<td>4</td>
<td>1,000</td>
<td>10,900 docs/sec</td>
</tr>
<tr>
<td>5</td>
<td>500</td>
<td>5</td>
<td>500</td>
<td>12,647 docs/sec</td>
</tr>
<tr>
<td>5</td>
<td>1,000</td>
<td>5</td>
<td>1,000</td>
<td>10,334 docs/sec</td>
</tr>
<tr>
<td>5</td>
<td>2,000</td>
<td>5</td>
<td>2,000</td>
<td>10,405 docs/sec</td>
</tr>
<tr>
<td>10</td>
<td>250</td>
<td>10</td>
<td>250</td>
<td>14,083 docs/sec</td>
</tr>
<tr>
<td>10</td>
<td>250</td>
<td>4</td>
<td>1,000</td>
<td>12,014 docs/sec</td>
</tr>
<tr>
<td>10</td>
<td>500</td>
<td>4</td>
<td>1,000</td>
<td>10,956 docs/sec</td>
</tr>
</tbody>
</table>
<p>It looks like we have a good set of candidates for maximizing the throughput for a single index, in between 12K and 14K documents per second. That doesn't mean we have reached our ceiling. Even though search operations are single threaded and every slice will trigger sequential search operations to read data, that does not prevent us from reading several indices in parallel.</p>
<p>By default, the maximum number of open scrolls is 500 — this limit can be updated with the search.max_open_scroll_context cluster setting, but the default value is enough for this particular migration.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/migrating-billion-log-lines-opensearch-elasticsearch/elastic-blog-5-index-rate-volatile.png" alt="5 - indexing rate" /></p>
<h2>Let’s migrate</h2>
<h3>Preparing our destination indices</h3>
<p>We are going to create a datastream called logs-myapplication-reindex to write the data to, but before indexing any data, let’s ensure our <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/index-templates.html">index template</a> and <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/ilm-index-lifecycle.html">index lifecycle management</a> configurations are properly set up. An index template acts as a blueprint for creating new indices, allowing you to define various settings that should be applied consistently across your indices.</p>
<p><strong>Index lifecycle management policy</strong><br />
Index lifecycle management (ILM) is equally vital, as it automates the management of indices throughout their lifecycle. With ILM, you can define policies that determine how long data should be retained, when it should be rolled over into new indices, and when old indices should be deleted or archived. Our policy is really straightforward:</p>
<pre><code class="language-bash">PUT _ilm/policy/logs-myapplication-lifecycle-policy
{
  &quot;policy&quot;: {
    &quot;phases&quot;: {
      &quot;hot&quot;: {
        &quot;actions&quot;: {
          &quot;rollover&quot;: {
            &quot;max_primary_shard_size&quot;: &quot;25gb&quot;
          }
        }
      },
      &quot;warm&quot;: {
        &quot;min_age&quot;: &quot;0d&quot;,
        &quot;actions&quot;: {
          &quot;forcemerge&quot;: {
            &quot;max_num_segments&quot;: 1
          }
        }
      }
    }
  }
}
</code></pre>
<p><strong>Index template (and saving 23% in disk space)</strong><br />
Since we are here, we’re going to go ahead and enable <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-source-field.html#synthetic-source">Synthetic Source</a>, a clever feature that allows us to store and discard the original JSON document while still reconstructing it when needed from the stored fields.</p>
<p>For our example, enabling Synthetic Source resulted in a remarkable <strong>23.4% improvement in storage efficiency</strong> , reducing the size required to store a single document from 241.2KB in OpenSearch to just <strong>185KB</strong> in Elasticsearch.</p>
<p>Our full index template is therefore:</p>
<pre><code class="language-bash">PUT _index_template/logs-myapplication-reindex
{
  &quot;index_patterns&quot;: [
    &quot;logs-myapplication-reindex&quot;
  ],
  &quot;priority&quot;: 500,
  &quot;data_stream&quot;: {},
  &quot;template&quot;: {
    &quot;settings&quot;: {
      &quot;index&quot;: {
        &quot;lifecycle.name&quot;: &quot;logs-myapplication-lifecycle-policy&quot;,
        &quot;codec&quot;: &quot;best_compression&quot;,
        &quot;number_of_shards&quot;: &quot;1&quot;,
        &quot;number_of_replicas&quot;: &quot;1&quot;,
        &quot;query&quot;: {
          &quot;default_field&quot;: [
            &quot;message&quot;
          ]
        }
      }
    },
    &quot;mappings&quot;: {
      &quot;_source&quot;: {
        &quot;mode&quot;: &quot;synthetic&quot;
      },
      &quot;_data_stream_timestamp&quot;: {
        &quot;enabled&quot;: true
      },
      &quot;date_detection&quot;: false,
      &quot;properties&quot;: {
        &quot;@timestamp&quot;: {
          &quot;type&quot;: &quot;date&quot;
        },
        &quot;agent&quot;: {
          &quot;properties&quot;: {
            &quot;ephemeral_id&quot;: {
              &quot;type&quot;: &quot;keyword&quot;,
              &quot;ignore_above&quot;: 1024
            },
            &quot;id&quot;: {
              &quot;type&quot;: &quot;keyword&quot;,
              &quot;ignore_above&quot;: 1024
            },
            &quot;name&quot;: {
              &quot;type&quot;: &quot;keyword&quot;,
              &quot;ignore_above&quot;: 1024
            },
            &quot;type&quot;: {
              &quot;type&quot;: &quot;keyword&quot;,
              &quot;ignore_above&quot;: 1024
            },
            &quot;version&quot;: {
              &quot;type&quot;: &quot;keyword&quot;,
              &quot;ignore_above&quot;: 1024
            }
          }
        },
        &quot;aws&quot;: {
          &quot;properties&quot;: {
            &quot;cloudwatch&quot;: {
              &quot;properties&quot;: {
                &quot;ingestion_time&quot;: {
                  &quot;type&quot;: &quot;keyword&quot;,
                  &quot;ignore_above&quot;: 1024
                },
                &quot;log_group&quot;: {
                  &quot;type&quot;: &quot;keyword&quot;,
                  &quot;ignore_above&quot;: 1024
                },
                &quot;log_stream&quot;: {
                  &quot;type&quot;: &quot;keyword&quot;,
                  &quot;ignore_above&quot;: 1024
                }
              }
            }
          }
        },
        &quot;cloud&quot;: {
          &quot;properties&quot;: {
            &quot;region&quot;: {
              &quot;type&quot;: &quot;keyword&quot;,
              &quot;ignore_above&quot;: 1024
            }
          }
        },
        &quot;data_stream&quot;: {
          &quot;properties&quot;: {
            &quot;dataset&quot;: {
              &quot;type&quot;: &quot;keyword&quot;,
              &quot;ignore_above&quot;: 1024
            },
            &quot;namespace&quot;: {
              &quot;type&quot;: &quot;keyword&quot;,
              &quot;ignore_above&quot;: 1024
            },
            &quot;type&quot;: {
              &quot;type&quot;: &quot;keyword&quot;,
              &quot;ignore_above&quot;: 1024
            }
          }
        },
        &quot;ecs&quot;: {
          &quot;properties&quot;: {
            &quot;version&quot;: {
              &quot;type&quot;: &quot;keyword&quot;,
              &quot;ignore_above&quot;: 1024
            }
          }
        },
        &quot;event&quot;: {
          &quot;properties&quot;: {
            &quot;dataset&quot;: {
              &quot;type&quot;: &quot;keyword&quot;,
              &quot;ignore_above&quot;: 1024
            },
            &quot;id&quot;: {
              &quot;type&quot;: &quot;keyword&quot;,
              &quot;ignore_above&quot;: 1024
            },
            &quot;ingested&quot;: {
              &quot;type&quot;: &quot;date&quot;
            }
          }
        },
        &quot;host&quot;: {
          &quot;type&quot;: &quot;object&quot;
        },
        &quot;input&quot;: {
          &quot;properties&quot;: {
            &quot;type&quot;: {
              &quot;type&quot;: &quot;keyword&quot;,
              &quot;ignore_above&quot;: 1024
            }
          }
        },
        &quot;log&quot;: {
          &quot;properties&quot;: {
            &quot;file&quot;: {
              &quot;properties&quot;: {
                &quot;path&quot;: {
                  &quot;type&quot;: &quot;keyword&quot;,
                  &quot;ignore_above&quot;: 1024
                }
              }
            }
          }
        },
        &quot;message&quot;: {
          &quot;type&quot;: &quot;match_only_text&quot;
        },
        &quot;meta&quot;: {
          &quot;properties&quot;: {
            &quot;file&quot;: {
              &quot;type&quot;: &quot;keyword&quot;,
              &quot;ignore_above&quot;: 1024
            }
          }
        },
        &quot;metrics&quot;: {
          &quot;properties&quot;: {
            &quot;size&quot;: {
              &quot;type&quot;: &quot;long&quot;
            },
            &quot;tmin&quot;: {
              &quot;type&quot;: &quot;long&quot;
            }
          }
        },
        &quot;process&quot;: {
          &quot;properties&quot;: {
            &quot;name&quot;: {
              &quot;type&quot;: &quot;keyword&quot;,
              &quot;ignore_above&quot;: 1024
            }
          }
        },
        &quot;tags&quot;: {
          &quot;type&quot;: &quot;keyword&quot;,
          &quot;ignore_above&quot;: 1024
        }
      }
    }
  }
}
</code></pre>
<h3>Building a custom Logstash image</h3>
<p>We are going to use a containerized Logstash for this migration because both clusters are sitting on a Kubernetes infrastructure, so it's easier to just spin up a Pod that will communicate to both clusters.</p>
<p>Since OpenSearch is not an official Logstash input, we must build a custom Logstash image that contains the logstash-input-opensearch plugin. Let’s use the base image from docker.elastic.co/logstash/logstash:9.3.2 and just install the plugin:</p>
<pre><code class="language-dockerfile">FROM docker.elastic.co/logstash/logstash:9.3.2

USER logstash
WORKDIR /usr/share/logstash
RUN bin/logstash-plugin install logstash-input-opensearch
</code></pre>
<h3>Writing a Logstash pipeline</h3>
<p>Now we have our Logstash Docker image, and we need to write a pipeline that will read from OpenSearch and write to Elasticsearch.</p>
<p><strong>The</strong> <strong>input</strong></p>
<pre><code class="language-ruby">input {
    opensearch {
        hosts =&gt; [&quot;os-cluster:9200&quot;]
        ssl =&gt; true
        ca_file =&gt; &quot;/etc/logstash/certificates/opensearch-ca.crt&quot;
        user =&gt; &quot;${OPENSEARCH_USERNAME}&quot;
        password =&gt; &quot;${OPENSEARCH_PASSWORD}&quot;
        index =&gt; &quot;${SOURCE_INDEX_NAME}&quot;
        slices =&gt; &quot;${SOURCE_SLICES}&quot;
        size =&gt; &quot;${SOURCE_PAGE_SIZE}&quot;
        scroll =&gt; &quot;5m&quot;
        docinfo =&gt; true
        docinfo_target =&gt; &quot;[@metadata][doc]&quot;
    }
}
</code></pre>
<p>Let’s break down the most important input parameters. The values are all represented as environment variables here:</p>
<ul>
<li><strong>hosts:</strong> Specifies the host and port of the OpenSearch cluster. In this case, it’s connecting to “os-cluster” on port 9200.</li>
<li><strong>index:</strong> Specifies the index in the OpenSearch cluster from which to retrieve logs. In this case, it’s “logs-myapplication-prod” which is a datastream that contains the actual indices (e.g., .ds-logs-myapplication-prod-000049).</li>
<li><strong>size:</strong> Specifies the maximum number of logs to retrieve in each request.</li>
<li><strong>scroll:</strong> Defines how long a search context will be kept open on the OpenSearch server. In this case, it’s set to “5m,” which means each request must be answered and a new “page” asked within five minutes.</li>
<li><strong>docinfo</strong> and <strong>docinfo_target:</strong> These settings control whether document metadata should be included in the Logstash output and where it should be stored. In this case, document metadata is being stored in the [@metadata][doc] field — this is important because the document’s _id will be used as the destination id as well.</li>
</ul>
<p>The ssl and ca_file are highly recommended if you are migrating from clusters that are in a different infrastructure (separate cloud providers). You don’t need to specify a ca_file if your TLS certificates are signed by a public authority, which is likely the case if you are using a SaaS and your endpoint is reachable over the internet. In this case, only ssl =&gt; true would suffice. In our case, all our TLS certificates are self-signed, so we must also provide the Certificate Authority (CA) certificate.</p>
<p><strong>The (optional)</strong> <strong>filter</strong><br />
We could use this to drop or alter the documents to be written to Elasticsearch if we wanted, but we are not going to, as we want to migrate the documents as is. We are only removing extra metadata fields that Logstash includes in all documents, such as &quot;@version&quot; and &quot;host&quot;. We are also removing the original &quot;data_stream&quot; as it contains the source data stream name, which might not be the same in the destination.</p>
<pre><code class="language-ruby">filter {
    mutate {
        remove_field =&gt; [&quot;@version&quot;, &quot;host&quot;, &quot;data_stream&quot;]
    }
}
</code></pre>
<p><strong>The</strong> <strong>output</strong><br />
The output is really simple — we are going to name our datastream logs-myapplication-reindex and we are using the document id of the original documents in document_id, to ensure there are no duplicate documents. In Elasticsearch, datastream names follow a convention &lt;type&gt;-&lt;dataset&gt;-&lt;namespace&gt; so our logs-myapplication-reindex datastream has “myapplication” as dataset and “prod” as namespace.</p>
<pre><code class="language-ruby">elasticsearch {
    hosts =&gt; &quot;${ELASTICSEARCH_HOST}&quot;

    user =&gt; &quot;${ELASTICSEARCH_USERNAME}&quot;
    password =&gt; &quot;${ELASTICSEARCH_PASSWORD}&quot;

    document_id =&gt; &quot;%{[@metadata][doc][_id]}&quot;

    data_stream =&gt; &quot;true&quot;
    data_stream_type =&gt; &quot;logs&quot;
    data_stream_dataset =&gt; &quot;myapplication&quot;
    data_stream_namespace =&gt; &quot;prod&quot;
}
</code></pre>
<h3>Deploying Logstash</h3>
<p>We have a few options to deploy Logstash: it can be deployed <a href="https://www.elastic.co/guide/en/logstash/current/running-logstash-command-line.html">locally from the command line</a>, as a <a href="https://www.elastic.co/guide/en/logstash/current/running-logstash.html">systemd service</a>, via <a href="https://www.elastic.co/guide/en/logstash/current/docker.html">docker</a>, or on <a href="https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-logstash.html">Kubernetes</a>.</p>
<p>Since both of our clusters are deployed in a Kubernetes environment, we are going to deploy Logstash as a <strong>Pod</strong> referencing our Docker image created earlier. Let’s put our pipeline inside a <strong>ConfigMap</strong> along with some configuration files (pipelines.yml and config.yml).</p>
<p>In the below configuration, we have SOURCE_INDEX_NAME, SOURCE_SLICES, SOURCE_PAGE_SIZE, LOGSTASH_WORKERS, and LOGSTASH_BATCH_SIZE conveniently exposed as environment variables so you just need to fill them out.</p>
<pre><code class="language-yaml">apiVersion: v1
kind: Pod
metadata:
  name: logstash-1
spec:
  containers:
    - name: logstash
      image: ugosan/logstash-opensearch-input:8.10.0
      imagePullPolicy: Always
      env:
        - name: SOURCE_INDEX_NAME
          value: &quot;.ds-logs-benchmark-dev-000037&quot;
        - name: SOURCE_SLICES
          value: &quot;10&quot;
        - name: SOURCE_PAGE_SIZE
          value: &quot;500&quot;
        - name: LOGSTASH_WORKERS
          value: &quot;4&quot;
        - name: LOGSTASH_BATCH_SIZE
          value: &quot;1000&quot;
        - name: OPENSEARCH_USERNAME
          valueFrom:
            secretKeyRef:
              name: os-cluster-admin-password
              key: username
        - name: OPENSEARCH_PASSWORD
          valueFrom:
            secretKeyRef:
              name: os-cluster-admin-password
              key: password
        - name: ELASTICSEARCH_USERNAME
          value: &quot;elastic&quot;
        - name: ELASTICSEARCH_PASSWORD
          valueFrom:
            secretKeyRef:
              name: es-cluster-es-elastic-user
              key: elastic
      resources:
        limits:
          memory: &quot;4Gi&quot;
          cpu: &quot;2500m&quot;
        requests:
          memory: &quot;1Gi&quot;
          cpu: &quot;300m&quot;
      volumeMounts:
        - name: config-volume
          mountPath: /usr/share/logstash/config
        - name: etc
          mountPath: /etc/logstash
          readOnly: true
  volumes:
    - name: config-volume
      projected:
        sources:
          - configMap:
              name: logstash-configmap
              items:
                - key: pipelines.yml
                  path: pipelines.yml
                - key: logstash.yml
                  path: logstash.yml
    - name: etc
      projected:
        sources:
          - configMap:
              name: logstash-configmap
              items:
                - key: pipeline.conf
                  path: pipelines/pipeline.conf
          - secret:
              name: os-cluster-http-cert
              items:
                - key: ca.crt
                  path: certificates/opensearch-ca.crt
          - secret:
              name: es-cluster-es-http-ca-internal
              items:
                - key: tls.crt
                  path: certificates/elasticsearch-ca.crt
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: logstash-configmap
data:
  pipelines.yml: |
    - pipeline.id: reindex-os-es
      path.config: &quot;/etc/logstash/pipelines/pipeline.conf&quot;
      pipeline.batch.size: ${LOGSTASH_BATCH_SIZE}
      pipeline.workers: ${LOGSTASH_WORKERS}
  logstash.yml: |
    log.level: info
    pipeline.unsafe_shutdown: true
    pipeline.ordered: false
  pipeline.conf: |
    input {
        opensearch {
          hosts =&gt; [&quot;os-cluster:9200&quot;]
          ssl =&gt; true
          ca_file =&gt; &quot;/etc/logstash/certificates/opensearch-ca.crt&quot;
          user =&gt; &quot;${OPENSEARCH_USERNAME}&quot;
          password =&gt; &quot;${OPENSEARCH_PASSWORD}&quot;
          index =&gt; &quot;${SOURCE_INDEX_NAME}&quot;
          slices =&gt; &quot;${SOURCE_SLICES}&quot;
          size =&gt; &quot;${SOURCE_PAGE_SIZE}&quot;
          scroll =&gt; &quot;5m&quot;
          docinfo =&gt; true
          docinfo_target =&gt; &quot;[@metadata][doc]&quot;
        }
    }

    filter {
        mutate {
            remove_field =&gt; [&quot;@version&quot;, &quot;host&quot;, &quot;data_stream&quot;]
        }
    }

    output {
        elasticsearch {
            hosts =&gt; &quot;https://es-cluster-es-http:9200&quot;
            ssl =&gt; true
            ssl_certificate_authorities =&gt; [&quot;/etc/logstash/certificates/elasticsearch-ca.crt&quot;]
            ssl_verification_mode =&gt; &quot;full&quot;

            user =&gt; &quot;${ELASTICSEARCH_USERNAME}&quot;
            password =&gt; &quot;${ELASTICSEARCH_PASSWORD}&quot;

            document_id =&gt; &quot;%{[@metadata][doc][_id]}&quot;

            data_stream =&gt; &quot;true&quot;
            data_stream_type =&gt; &quot;logs&quot;
            data_stream_dataset =&gt; &quot;myapplication&quot;
            data_stream_namespace =&gt; &quot;reindex&quot;
        }
    }
</code></pre>
<h2>That’s it.</h2>
<p>After a couple hours, we successfully migrated 1 billion documents from OpenSearch to Elasticsearch and even saved 23% plus on disk storage! Now that we have the logs in Elasticsearch how about extracting actual business value from them? Logs contain so much valuable information - we can not only do all sorts of interesting things with AIOPS, like <a href="https://www.elastic.co/guide/en/observability/current/categorize-logs.html#analyze-log-categories">Automatically Categorize</a> those logs, but also extract <a href="https://www.youtube.com/watch?v=0E7isxR_FzY&amp;list=PLzPXmNbs8vqUc2bROb1E2gNyj2GynRB5b&amp;index=3&amp;t=1122s">business metrics</a> and <a href="https://www.youtube.com/watch?v=0E7isxR_FzY&amp;list=PLzPXmNbs8vqUc2bROb1E2gNyj2GynRB5b&amp;index=3&amp;t=1906s">detect anomalies</a> on them, give it a try.</p>
<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>OpenSearch</td>
<td></td>
<td></td>
<td>Elasticsearch</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Index</td>
<td>docs</td>
<td>size</td>
<td>Index</td>
<td>docs</td>
<td>size</td>
<td>Diff.</td>
</tr>
<tr>
<td>.ds-logs-myapplication-prod-000037</td>
<td>116842158</td>
<td>27285520870</td>
<td>logs-myapplication-reindex-000037</td>
<td>116842158</td>
<td>21998435329</td>
<td>21.46%</td>
</tr>
<tr>
<td>.ds-logs-myapplication-prod-000038</td>
<td>110994116</td>
<td>27263291740</td>
<td>logs-myapplication-reindex-000038</td>
<td>110994116</td>
<td>21540011082</td>
<td>23.45%</td>
</tr>
<tr>
<td>.ds-logs-myapplication-prod-000040</td>
<td>113362823</td>
<td>27872438186</td>
<td>logs-myapplication-reindex-000040</td>
<td>113362823</td>
<td>22234641932</td>
<td>22.50%</td>
</tr>
<tr>
<td>.ds-logs-myapplication-prod-000041</td>
<td>112400019</td>
<td>27618801653</td>
<td>logs-myapplication-reindex-000041</td>
<td>112400019</td>
<td>22059453868</td>
<td>22.38%</td>
</tr>
<tr>
<td>.ds-logs-myapplication-prod-000042</td>
<td>113859174</td>
<td>26686723701</td>
<td>logs-myapplication-reindex-000042</td>
<td>113859174</td>
<td>21093766108</td>
<td>23.41%</td>
</tr>
<tr>
<td>.ds-logs-myapplication-prod-000043</td>
<td>113821016</td>
<td>27657006598</td>
<td>logs-myapplication-reindex-000043</td>
<td>113821016</td>
<td>22059454752</td>
<td>22.52%</td>
</tr>
<tr>
<td>.ds-logs-myapplication-prod-000044</td>
<td>111093596</td>
<td>27281936915</td>
<td>logs-myapplication-reindex-000044</td>
<td>111093596</td>
<td>21559513422</td>
<td>23.43%</td>
</tr>
<tr>
<td>.ds-logs-myapplication-prod-000048</td>
<td>114273539</td>
<td>28111420495</td>
<td>logs-myapplication-reindex-000048</td>
<td>114273539</td>
<td>22264398939</td>
<td>23.21%</td>
</tr>
<tr>
<td>.ds-logs-myapplication-prod-000049</td>
<td>102519334</td>
<td>23731274338</td>
<td>logs-myapplication-reindex-000049</td>
<td>102519334</td>
<td>19307250001</td>
<td>20.56%</td>
</tr>
</tbody>
</table>
<p>Interested in trying Elasticsearch? <a href="https://cloud.elastic.co/registration?elektra=en-cloud-page">Start our 14-day free trial</a>.</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/migrating-billion-log-lines-opensearch-elasticsearch/elastic-blog-header-1-billion-log-lines.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[AIOps with Elastic Observability: Modern AIOps & Log Intelligence]]></title>
            <link>https://www.elastic.co/observability-labs/blog/modern-aiops-elastic-observability</link>
            <guid isPermaLink="false">modern-aiops-elastic-observability</guid>
            <pubDate>Wed, 26 Nov 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Exploring modern AIOps capabilities, including anomaly detection, log intelligence, and log analysis & categorization with Elastic Observability.]]></description>
            <content:encoded><![CDATA[<h1>AIOps Blog Refresher: Unlocking Intelligence from Your Logs with Elastic</h1>
<p>Elastic has been leading the charge with AIOps, especially in the recent 9.2 update of Elastic Observability with Streams. The conversation around AIOps has shifted dramatically as we move through the year. DevOps and SRE teams aren't asking whether they need AIOps, they're asking how to leverage it more effectively to stay ahead of exponentially growing complexity.</p>
<p>The current challenge of AIOps is that modern cloud-native environments generate massive volumes of telemetry data that are magnitudes larger than past environments. But here's what many teams overlook: logs are the richest source of operational intelligence you have. Logs are able to tell you exactly what happened and why, while metrics only tell you something is wrong, and traces only tell you where. The problem is that most organizations are drowning in logs. Microservices, such as user authentications or inventories, serverless functions, and Kubernetes generate millions of log entries daily. Without AI and machine learning, finding meaningful patterns in this data takes too much time and energy.</p>
<h2>Log Intelligence Improvement: What's New in 2025</h2>
<p>Historically in observability, unlocking your log intelligence included long manual effort that required not only parsing through logs, but also structuring those logs. Elastic Observability has drastically changed how teams extract value from logs. Observability is not just simple signal analysis - modern tools need to have proactive, log-driven investigations. At Elastic, this modernity is Streams.</p>
<p>Streams, a new release from Elastic, is a collection of AI-driven tools that identify significant events in parsed raw logs by enriching logs with meaningful fields. With Streams, SREs can maximize the value of their data, their logs, and their systems. With system reliability as the goal, Streams helps to reduce pipeline management overhead and accelerates observability analysis. And it takes nearly no time to set up!</p>
<p>Here is how Streams powers the Elastic Observability capabilities available now.</p>
<h3>Advanced Log Rate Analysis</h3>
<p>Log rate analysis can go far beyond only detecting spikes. Elastic's machine learning automatically identifies when log volumes deviate from expected baselines, then contextualizes these changes within your broader system performance. When your application suddenly generates more error logs, Elastic’s AIOps doesn't just alert you, it also determines whether it's a critical issue requiring immediate attention or just a temporary anomaly.</p>
<p>This matters to your analysis because not all log spikes are equal. A 10x increase in DEBUG logs might indicate verbose logging accidentally enabled in production. A 2x increase in ERROR logs could signal a cascading failure. Log rate analysis distinguishes between these scenarios automatically, giving your team the context needed to respond appropriately.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/modern-aiops-elastic-observability/log-analysis.png" alt="Log Analysis" /></p>
<h3>Intelligent Log Categorization with Streams</h3>
<p>This is where AIOps shines with log data. Streams uses machine learning algorithms in order to automatically classify and group similar log patterns, dramatically reducing noise. Instead of manually parsing millions of entries, the system identifies common structures, groups related events, and surfaces the categories that matter most.</p>
<p>Logs are unstructured by nature, making them difficult to analyze at scale. Streams corrals chaotic log streams into organized, queryable patterns. Instantly, you can see that 80% of your errors fall into three categories, helping you prioritize where to focus remediation efforts. This approach helps you reduce noise and accelerate analysis, allowing teams to act on insights faster.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/modern-aiops-elastic-observability/categories.png" alt="Log Categorizations" /></p>
<h3>Multi-Dimensional Anomaly Detection</h3>
<p><a href="https://www.elastic.co/docs/explore-analyze/machine-learning/anomaly-detection">Anomaly detection</a> now simultaneously examines relationships between logs, metrics, and traces. A slight increase in response time might not trigger an alert by itself, but when correlated with unusual log patterns and memory consumption changes, the system recognizes it as an early warning sign.</p>
<p>Logs contain a myriad of contextual information that metrics and traces can't capture: stack traces, user IDs, transaction details, error messages, etc. By correlating log anomalies with other signals, you get the full picture of what's happening in your system. This whole holistic view enables teams to catch issues earlier, as well as understand their full impact across the stack.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/modern-aiops-elastic-observability/anomalies.png" alt="Anomaly Detection" /></p>
<h3>Enhanced Root Cause Analysis Powered by Significant Events</h3>
<p>When an issue occurs, Elastic's Streams accelerates root cause analysis through AI-assisted parsing of logs and bringing about <a href="https://www.elastic.co/docs/solutions/observability/streams/management/significant-events">“Significant events.”</a> Significant event queries can be defined by AI or manually, depending on if you know what logs you are looking for or not. Then, Elastic’s AIOps traces the problem through your entire stack using these events, as well as enriched log data combined with distributed tracing. This system is able to correlate failed transactions with specific log entries, deployment events, and infrastructure changes. This helps you understand not just what broke, but why and when.</p>
<p>Streams makes the analysis of your logs quick and automatic by going across your entire distributed system within seconds, grabbing relevant log entries such as stack traces, state information, error messages, and more. What used to require hours of manual investigation and deduction now happens automatically, freeing you and your team from tedious detective work and enabling faster resolution. </p>
<h2>Logs in Action: Real-World Impact</h2>
<p>Let's look at how these capabilities work together in practice. Imagine your payment processing service is experiencing intermittent failures - only 0.5% of transactions, but enough to concern your team. Traditional monitoring shows everything is mostly okay, but customers are still complaining.</p>
<p>Without Streams, an SRE might initially run some broad queries, manually sift through thousands of logs, struggle to connect all the dots, and ultimately not understand the correlation between the errors and recent system changes. </p>
<p>With Elastic Streams and AIOps, many of these potential problems are instantly mitigated:</p>
<ul>
<li>
<p>Streams automatically parse the payment service, adding connection timeouts to a new category of significant events</p>
</li>
<li>
<p>Log rate analysis with Streams reveal that this significant event category has been slowly growing over the past month, showing growth of the timeouts from a small number of occurrences into a larger amount</p>
</li>
<li>
<p>Elastic’s built-in anomaly detection correlates these significant events with deployment data, and identifies that they started appearing after a recent load balancer configuration</p>
</li>
<li>
<p>Root analysis pinpoints the exact database connection pool setting that is too restrictive for peak load by tracing affected transactions through previously enriched logs</p>
</li>
</ul>
<p>What usually takes 4-8 hours of manual log analysis is resolved in minutes, with Elastic automatically highlighting the relevant log entries that tell the complete story. This is the power of AIOps and Streams as applied to log intelligence.</p>
<h2>The Power of Unified Log Intelligence</h2>
<p>What sets Elastic apart is treating logs as a priority in your observability strategy. Elastic provides comprehensive log ingestion that centralizes petabytes of logs from across your infrastructure with flexible parsing and enrichment. The platform uses purpose-built machine learning models that understand log patterns, not generic algorithms retrofitted for log analysis.</p>
<p>Logs don't exist in isolation, which is why Elastic correlates log data with metrics, traces, and business events to provide complete context. And because log volumes can be massive, Elastic's tiered storage approach means you can retain years of logs for compliance and historical analysis without breaking the budget.</p>
<h2>Why Logs Matter More Than Ever</h2>
<p>Logs have become the cornerstone of effective AIOps for three critical reasons.</p>
<p>First off, logs capture what metrics can't. A metric tells you the CPU is at 80%, but a log tells you which process is consuming resources and why. This level of detail is essential for understanding not just that something is wrong, but what specifically is causing the problem.</p>
<p>Second, logs provide business context. Error messages contain user IDs, transaction ldetails, and business logic failures that help you understand customer impact. When you're troubleshooting an issue, knowing which customers are affected and what they were trying to do is invaluable for prioritizing your response.</p>
<p>Third, logs enable true root cause analysis. Stack traces, error messages, and application state captured in logs are essential for understanding the why behind every incident. Without this information, teams are left guessing at root causes rather than definitively identifying and fixing them.</p>
<p>The teams winning with AIOps in 2025 aren't just monitoring metrics, they're extracting intelligence from their logs at scale, turning operational data into actionable insights.</p>
<h2>Transform Your Log Strategy Today</h2>
<p>Every hour your team spends manually searching through logs is an hour they're not spending on innovation. Every incident that could have been prevented through intelligent log analysis represents both technical debt and business risk.</p>
<p>Elastic Observability provides the foundation you need to unlock the intelligence hidden in your logs. With automatic categorization, anomaly detection, and ML-powered analysis, you can start seeing value immediately. Check out this recent <a href="https://www.elastic.co/observability-labs/blog/elastic-observability-streams-ai-logs-investigations">article</a> to get started with Elastic Streams and Observability today!</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/modern-aiops-elastic-observability/blog-header.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[The observability gap: Why your monitoring strategy isn't ready for what's coming next]]></title>
            <link>https://www.elastic.co/observability-labs/blog/modern-observability-opentelemetry-correlation-ai</link>
            <guid isPermaLink="false">modern-observability-opentelemetry-correlation-ai</guid>
            <pubDate>Mon, 25 Aug 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[The increasing complexity of distributed applications and the observability data they generate creates challenges for SREs and IT Operations teams. Take a look at how you can close this observability gap with OpenTelemetry and the right strategy.]]></description>
            <content:encoded><![CDATA[<p>Anyone that’s been to London knows the announcements at the Tube to “Mind the gap” but what about the gap that’s developing in our monitoring and observability strategies? I’ve been through this toil before, and have run a distributed system that was humming along perfectly. My alerts were manageable, my dashboards made sense, and when things broke, I could usually track down the issue in a reasonable amount of time.</p>
<p>Fast forward 3-5 years and things have changed, we added Kubernetes, embraced microservices, maybe these days you might have even sprinkled in some AI-powered features. Suddenly, you're drowning in telemetry data, your alert fatigue is real, and correlating issues across your distributed architecture feels stressful.</p>
<p>You're experiencing what I call the &quot;observability gap&quot;, where system complexity rockets ahead while our monitoring maturity crawls behind. Today, we're going to explore why this gap exists, what's driving it wider, and most importantly, how to close it using modern observability practices.</p>
<h2>The complexity rocket ship has left the station</h2>
<p><img src="https://www.elastic.co/observability-labs/assets/images/modern-observability-opentelemetry-correlation-ai/ObservabilityGapBlog-Image2.jpg" alt="Observability Gap" /></p>
<p>Let's be honest about what we're dealing with. The scale and complexity of our infrastructure isn't growing linearly, it's exponential. We've gone from monolithic applications running on physical servers to container orchestration platforms managing hundreds of microservices, with AI algorithms now starting to make scaling decisions autonomously.</p>
<p>This trajectory shows no signs of slowing down. With AI-assisted coding accelerating development cycles and intelligent orchestration systems like Kubernetes evolving toward predictive scaling, we're looking at infrastructure that's not just complex, but dynamically complex.</p>
<p>Meanwhile, our observability tooling? It's stuck in the past, designed for a world where you knew exactly how many servers you had and could manually correlate logs with metrics by cross-referencing timestamps.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/modern-observability-opentelemetry-correlation-ai/ObservabilityGapBlog-Image3.jpg" alt="Observability Gap part 2" /></p>
<h2>The telemetry data explosion (and why sampling isn't the answer)</h2>
<p>One of the first things teams notice as they scale is their observability bill climbing faster than their infrastructure costs. The knee-jerk reaction is often to start sampling data downsample metrics, head-sample traces, deduplicate logs. While these techniques have their place, they're fundamentally at odds with where we're heading.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/modern-observability-opentelemetry-correlation-ai/ObservabilityGapBlog-Image4.jpg" alt="Data Management: Reduce fidelity of data" /></p>
<p>Here's the thing: ML and AI systems thrive on rich, contextual data. When you sample away the &quot;noise,&quot; you're often discarding the very signals that could help you understand system behavior patterns or predict failures. Instead of asking &quot;how can we collect less data?&quot;, the better question is &quot;how can we store and process all this data cost-effectively?&quot;</p>
<p>Modern storage architectures, particularly those leveraging object storage and advanced compression techniques like ZStandard, can achieve remarkable cost-to-value ratios. The secret is organizing related data together and moving it to cheaper storage tiers quickly. This approach lets you have your cake and eat it too, full fidelity data retention without breaking the bank.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/modern-observability-opentelemetry-correlation-ai/ObservabilityGapBlog-Image5.jpg" alt="Data Management: Make Storage Cheaper" /></p>
<p>Now of course there is a balance to this and not all your applications are equal, so as a first step you should look at all your most critical flows and applications and ensure that they have the richest telemetry. Do not use a sledge hammer approach and sample all your data just to reduce bills when a scalpel is best.</p>
<h2>OpenTelemetry (OTel): the foundation everything else builds on</h2>
<p>If I had to pick the single most transformative change in observability during my career, it would be OpenTelemetry. Not because it's flashy or revolutionary in concept, but because it solves fundamental problems that have plagued us for years.</p>
<p>Before OTel, instrumenting applications meant vendor lock-in. Want to switch from vendor A to vendor B? Good luck re-instrumenting your entire codebase. Want to send the same telemetry to multiple backends? Hope you enjoy maintaining multiple agent configurations.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/modern-observability-opentelemetry-correlation-ai/ObservabilityGapBlog-Image6.jpg" alt="What is OpenTelemetry" /></p>
<p>OpenTelemetry changes things completely. Here's the three main reasons why.</p>
<p><strong>Vendor Neutrality:</strong> Your instrumentation code becomes portable. The same OTEL SDK can send data to any compliant backend.</p>
<p><strong>OpenTelemetry Semantic Conventions:</strong> All your telemetry (logs, metrics, traces, profiles, wide-events) shares common metadata like service names, resource attributes, and trace context.</p>
<p><strong>Auto-Instrumentation:</strong> For most popular languages and frameworks, you get rich telemetry with zero code changes.</p>
<p>OTEL also makes manual instrumentation incredibly valuable with minimal effort. Adding a single line like this</p>
<p><code>baggage.set_baggage(&quot;customer.id&quot;, &quot;alice123&quot;)</code></p>
<p>In your authentication service means that customer ID automatically flows through every downstream service call, every database query, every log message. Suddenly, you can search all your telemetry data by customer ID across your entire distributed system.</p>
<p>The trajectory is clear: within a few years, OTel will be as ubiquitous and invisible as Kubernetes is becoming today. Runtimes will include it by default, cloud providers will offer OTel collectors at the edge, and frameworks will come pre-instrumented.</p>
<h2>Correlation: the secret sauce that makes everything click</h2>
<p><img src="https://www.elastic.co/observability-labs/assets/images/modern-observability-opentelemetry-correlation-ai/ObservabilityGapBlog-Image7.jpg" alt="Why do we need correlation?" /></p>
<p>You get an alert about high latency. You check your metrics dashboard yep, 95th percentile is spiking. You switch to your tracing system and you can see some slow requests. You hop over to your logging system and there are some error messages around the same time. Now comes the fun part: figuring out which logs correspond to which traces and whether they're related to the metric that alerted you.</p>
<p>This context-switching nightmare is exactly what proper correlation eliminates. When your telemetry data shares common identifiers for example, trace IDs in logs, consistent service names, synchronized timestamps or even customer IDs you can seamlessly pivot between different signal types without losing context.</p>
<p>But correlation goes beyond just technical convenience. When you can search all your logs by customer.id and immediately see the traces and metrics for that customer's journey through your system, you transform how you approach support and debugging. When you can filter your entire observability stack by deployment version and instantly understand the impact of a release, you change how you think about deployments.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/modern-observability-opentelemetry-correlation-ai/ObservabilityGapBlog-Image8.jpg" alt="How does this work?" /></p>
<p>Metrics? Yes, even metrics can be correlated by using OpenTelemetry exemplars, for example using python you would turn on exemplars as follows.</p>
<pre><code class="language-python"># Setup metrics with exemplars enabled

exemplar_filter = ExemplarFilter(trace_based=True)  

exemplar_reservoir = ExemplarReservoir(

    exemplar_filter=exemplar_filter,`

    max_exemplars=5
)
</code></pre>
<p>This would then associate metrics with a trace that happens to be occurring so you get some metrics correlated to your traces.</p>
<h2>Then again, why correlate at all?</h2>
<p>So you may be thinking, this is great and I can see this being a useful strategy. It is especially useful when you have metrics, logs and traces in separate systems, however, pretty soon you realize that it's a lot of effort when you could just combine all this data together in a single data structure and avoid the need to correlate at all. The observability industry agrees and has recently been espousing the benefits of a new signal type called wide-events.</p>
<p>Wide-events are just really structured logs, the idea is to put metric data, trace data and log data all into the same wide data structure which can make analysis much easier. Think about it, if you have a single data structure you can very quickly run queries and aggregations without having to join any data which can get pretty expensive.</p>
<p>Additionally you are increasing the information density per log record which is particularly great for AI applications.  AI gets a context-rich dataset to do analysis on with minimal latency, a single record with enough descriptive capability to quickly find the root cause of your issue without having to dig around in other data stores and try to figure out whatever schema those data stores are using.</p>
<p>LLMs especially LOVE context and if you can give them all the context they need without having them try to find it, your investigation time will significantly reduce.</p>
<p>This isn't just about making SRE life easier (though it does that). It's about creating the rich, interconnected dataset that AI and ML systems need to understand your infrastructure's behavior patterns.</p>
<h2>AI-driven investigations</h2>
<p>Observability tools today have been pretty good at solving the alerting fatigue and dashboarding problems, things have gotten quite mature there. Alert correlation and other techniques drastically reduce the noise in these domains, not to mention a focus on being alerted by SLOs instead of pure technical metrics. Life has gotten better over the past few years for SREs here.</p>
<p>Now alerts are one piece of the puzzle but the latest AI techniques using LLMs and agentic AI can unlock time savings in a different spot, during investigations. Think about it, investigations are typically what drags on when you have an outage, the cognitive overload while the pressure is on is very real and pretty stressful for SREs.</p>
<p>The good news is that when we get our data in good shape with correlation, enrichment and adopting wide-events and we store the data in full fidelity we now have the tools to help us drive faster investigations.</p>
<p>LLMs can take all that rich data and do some very powerful analysis that can cut down your investigation time. Let's walk through an example.</p>
<p>Imagine we have the following basic log. We only have a limited amount of data for an LLM to reason about. All it can tell is that a database failed.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/modern-observability-opentelemetry-correlation-ai/ObservabilityGapBlog-Image9.jpg" alt="What is a basic log" /></p>
<p>Let's see what this looks like when we use a wide-event, notice that already we can see some significant benefits, firstly we only had to visit the log from a single node, the node that serviced the request. We didn’t have to dig into downstream logs. This already makes life easier for the LLM; it doesn't have to figure out how to correlate multiple log lines and traces and metrics though we do still have correlation IDs if we desperately need to look in downstream systems.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/modern-observability-opentelemetry-correlation-ai/ObservabilityGapBlog-Image10.jpg" alt="App Log" /></p>
<p>Next we have all this additional rich data that an LLM can use to reason about what happened. LLMs work best with context and if you can feed them as much context as possible they will work more effectively to reduce your investigation time.</p>
<table>
<thead>
<tr>
<th>Field</th>
<th>How an LLM uses it</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>trace_id</code>, <code>parent_span_id</code></td>
<td>Thread every hop together without parsing free-text</td>
</tr>
<tr>
<td><code>status.code</code>, <code>error.*</code></td>
<td>Precise failure class; no NLP guess-work</td>
</tr>
<tr>
<td><code>db.*</code></td>
<td>Root-cause surface (&quot;postgres isn't provisioned&quot;)</td>
</tr>
<tr>
<td><code>user.id</code>, <code>cloud.region</code></td>
<td>Instant blast-radius queries</td>
</tr>
<tr>
<td><code>deployment.version</code></td>
<td>Correlation with new releases</td>
</tr>
</tbody>
</table>
<p>Notice that we didn’t get rid of the unstructured error message, this is still useful context! LLMs are great at processing unstructured text so this textual description helps it understand the problem even further.</p>
<p>Large language models shine when they’re handed complete, context-rich evidence, exactly what wide-event logging supplies. Invest once in richer logs, and every downstream AI workflow (summaries, anomaly detection, natural-language queries) becomes simpler, cheaper, and far more reliable.</p>
<h2>Building toward the future</h2>
<p>As I look ahead, three trends seem inevitable:</p>
<ol>
<li>
<p><strong>OpenTelemetry semantic conventions powers wide-events:</strong> OTel semantic conventions will become as standard as logging is today to create wide-events. Cloud providers, runtimes, and frameworks will use it by default.</p>
</li>
<li>
<p><strong>Making sense of logs with LLMs:</strong> Both improving the richness of your data and having LLMs automatically improve the richness of your existing logs will become essential for shortening investigation times.</p>
</li>
<li>
<p><strong>AI will be essential</strong>: As system complexity outpaces human cognitive ability to understand it, AI assistance will become necessary for maintaining reasonable investigation times.</p>
</li>
</ol>
<p>The organizations that start building toward this future now, adopting OpenTelemetry, investing in richer observability, and beginning to experiment with AI-assisted debugging will have a significant advantage as these trends accelerate.</p>
<h2>Your next steps</h2>
<p>If you're dealing with the observability gap in your own environment, here's where I'd start</p>
<ol>
<li>
<p><strong>Evaluate your logs:</strong> Do your logs have the richness of data you need to shorten investigation times? Can LLMs help provide additional context?</p>
</li>
<li>
<p><strong>Start experimenting with OpenTelemetry:</strong> Even if you can't migrate everything immediately, instrumenting new services with OTel and using semantic conventions to produce wide-events gives you experience with the technology and starts building your enriched dataset.</p>
</li>
<li>
<p><strong>Add high-value context:</strong> Customer IDs, session IDs, deployment versions even small amounts of contextual metadata can dramatically improve your debugging capabilities.</p>
</li>
<li>
<p><strong>Think beyond storage costs:</strong> Instead of sampling data away, investigate modern storage architectures that let you keep everything at a reasonable cost for your most critical services.</p>
</li>
</ol>
<p>The complexity rocket ship has left the station, and it's not slowing down. The question isn't whether your observability strategy needs to evolve; it's whether you'll evolve it proactively or reactively. I know which approach leads to better sleep at night.</p>
<h2>Additional resources</h2>
<ul>
<li><a href="https://www.elastic.co/virtual-events/getting-started-logging">Getting started with logging on the ELK Stack webinar</a></li>
<li><a href="https://www.elastic.co/observability-labs/blog/the-next-evolution-of-observability-unifying-data-with-opentelemetry-and-generative-ai">The next evolution of observability: unifying data with OpenTelemetry and generative AI blog</a></li>
<li><a href="https://www.elastic.co/observability-labs/blog/elastic-agent-pivot-opentelemetry">Pivoting Elastic's Data Ingestion to OpenTelemetry blog</a></li>
</ul>]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/modern-observability-opentelemetry-correlation-ai/ObservabilityGapBlog-Image1.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Monitor dbt pipelines with Elastic Observability]]></title>
            <link>https://www.elastic.co/observability-labs/blog/monitor-dbt-pipelines-with-elastic-observability</link>
            <guid isPermaLink="false">monitor-dbt-pipelines-with-elastic-observability</guid>
            <pubDate>Fri, 26 Jul 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how to set up a dbt monitoring system with Elastic that proactively alerts on data processing cost spikes, anomalies in rows per table, and data quality test failures]]></description>
            <content:encoded><![CDATA[<p>In the Data Analytics team within the Observability organization in Elastic, we use <a href="https://www.getdbt.com/product/what-is-dbt">dbt (dbt™, data build tool)</a> to execute our SQL data transformation pipelines. dbt is a SQL-first transformation workflow that lets teams quickly and collaboratively deploy analytics code. In particular, we use <a href="https://docs.getdbt.com/docs/core/installation-overview">dbt core</a>, the <a href="https://github.com/dbt-labs/dbt-core">open-source project</a>, where you can develop from the command line and run your dbt project.</p>
<p>Our data transformation pipelines run daily and process the data that feed our internal dashboards, reports, analyses, and Machine Learning (ML) models.</p>
<p>There have been incidents in the past when the pipelines have failed, the source tables contained wrong data or we have introduced a change into our SQL code that has caused data quality issues, and we only realized once we saw it in a weekly report that was showing an anomalous number of records. That’s why we have built a monitoring system that proactively alerts us about these types of incidents as soon as they happen and helps us with visualizations and analyses to understand their root cause, saving us several hours or days of manual investigations.</p>
<p>We have leveraged our own Observability Solution to help solve this challenge, monitoring the entire lifecycle of our dbt implementation. This setup enables us to track the behavior of our models and conduct data quality testing on the final tables. We export dbt process logs from run jobs and tests into Elasticsearch and utilize Kibana to create dashboards, set up alerts, and configure Machine Learning jobs to monitor and assess issues.</p>
<p>The following diagram shows our complete architecture. In a follow-up article, we’ll also cover how we observe our python data processing and ML model processes using OTEL and Elastic - stay tuned.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-dbt-pipelines-with-elastic-observability/architecture.png" alt="1 - architecture" /></p>
<h2>Why monitor dbt pipelines with Elastic?</h2>
<p>With every invocation, dbt generates and saves one or more JSON files called <a href="https://docs.getdbt.com/reference/artifacts/dbt-artifacts">artifacts</a> containing log data on the invocation results. <code>dbt run</code> and <code>dbt test</code> invocation logs are <a href="https://docs.getdbt.com/reference/artifacts/run-results-json">stored in the file <code>run_results.json</code></a>, as per the dbt documentation:</p>
<blockquote>
<p>This file contains information about a completed invocation of dbt, including timing and status info for each node (model, test, etc) that was executed. In aggregate, many <code>run_results.json</code> can be combined to calculate average model runtime, test failure rates, the number of record changes captured by snapshots, etc.</p>
</blockquote>
<p>Monitoring <code>dbt run</code> invocation logs can help solve several issues, including tracking and alerting about table volumes, detecting excessive slot time from resource-intensive models, identifying cost spikes due to slot time or volume, and pinpointing slow execution times that may indicate scheduling issues. This system was crucial when we merged a PR with a change in our code that had an issue, producing a sudden drop in the number of daily rows in upstream Table A. By ingesting the <code>dbt run</code> logs into Elastic, our anomaly detection job quickly identified anomalies in the daily row counts for Table A and its downstream tables, B, C, and D. The Data Analytics team received an alert notification about the issue, allowing us to promptly troubleshoot, fix and backfill the tables before it affected the weekly dashboards and downstream ML models.</p>
<p>Monitoring <code>dbt test</code> invocation logs can also address several issues, such as identifying duplicates in tables, detecting unnoticed alterations in allowed values for specific fields through validation of all enum fields, and resolving various other data processing and quality concerns. With dashboards and alerts on data quality tests, we proactively identify issues like duplicate keys, unexpected category values, and increased nulls, ensuring data integrity. In our team, we had an issue where a change in one of our raw lookup tables produced duplicated rows in our user table, doubling the number of users reported. By ingesting the <code>dbt test</code> logs into Elastic, our rules detected that some duplicate tests had failed. The team received an alert notification about the issue, allowing us to troubleshoot it right away by finding the upstream table that was the root cause. These duplicates meant that downstream tables had to process 2x the amount of data, creating a spike in the bytes processed and slot time. The anomaly detection and alerts on the <code>dbt run</code> logs also helped us spot these spikes for individual tables and allowed us to quantify the impact on our billing.</p>
<p>Processing our dbt logs with Elastic and Kibana allows us to obtain real-time insights, helps us quickly troubleshoot potential issues, and keeps our data transformation processes running smoothly. We set up anomaly detection jobs and alerts in Kibana to monitor the number of rows processed by dbt, the slot time, and the results of the tests. This lets us catch real-time incidents, and by promptly identifying and fixing these issues, Elastic makes our data pipeline more resilient and our models more cost-effective, helping us stay on top of cost spikes or data quality issues.</p>
<p>We can also correlate this information with other events ingested into Elastic, for example using the <a href="https://www.elastic.co/guide/en/enterprise-search/current/connectors-github.html">Elastic Github connector</a>, we can correlate data quality test failures or other anomalies with code changes to find the root cause of the commit or PR that caused the issues. By ingesting application logs into Elastic, we can also analyze if these issues in our pipelines have affected downstream applications, increasing latency, throughput or error rates using APM. Ingesting billing, revenue data or web traffic, we could also see the impact in business metrics.</p>
<h2>How to export dbt invocation logs to Elasticsearch</h2>
<p>We use the <a href="https://elasticsearch-py.readthedocs.io/en">Python Elasticsearch client</a> to send the dbt invocation logs to Elastic after we run our <code>dbt run</code> and <code>dbt test</code> processes daily in production. The setup just requires you to install the <a href="https://elasticsearch-py.readthedocs.io/en/v8.14.0/quickstart.html#installation">Elasticsearch Python client</a> and obtain your Elastic Cloud ID (go to <a href="https://cloud.elastic.co/deployments/">https://cloud.elastic.co/deployments/</a>, select your deployment and find the <code>Cloud ID</code>) and Elastic Cloud API Key <a href="https://elasticsearch-py.readthedocs.io/en/v8.14.0/quickstart.html#connecting">(following this guide)</a></p>
<p>This python helper function will index the results from your <code>run_results.json</code> file to the specified index. You just need to export the variables to the environment:</p>
<ul>
<li><code>RESULTS_FILE</code>: path to your <code>run_results.json</code> file</li>
<li><code>DBT_RUN_LOGS_INDEX</code>: the name you want to give to dbt run logs index in Elastic, e.g. <code>dbt_run_logs</code></li>
<li><code>DBT_TEST_LOGS_INDEX</code>: the name you want to give to the dbt test logs index in Elastic, e.g. <code>dbt_test_logs</code></li>
<li><code>ES_CLUSTER_CLOUD_ID</code></li>
<li><code>ES_CLUSTER_API_KEY</code></li>
</ul>
<p>Then call the function <code>log_dbt_es</code> from your python code or save this code as a python script and run it after executing your <code>dbt run</code> or <code>dbt test</code> commands:</p>
<pre><code>from elasticsearch import Elasticsearch, helpers
import os
import sys
import json

def log_dbt_es():
   RESULTS_FILE = os.environ[&quot;RESULTS_FILE&quot;]
   DBT_RUN_LOGS_INDEX = os.environ[&quot;DBT_RUN_LOGS_INDEX&quot;]
   DBT_TEST_LOGS_INDEX = os.environ[&quot;DBT_TEST_LOGS_INDEX&quot;]
   es_cluster_cloud_id = os.environ[&quot;ES_CLUSTER_CLOUD_ID&quot;]
   es_cluster_api_key = os.environ[&quot;ES_CLUSTER_API_KEY&quot;]


   es_client = Elasticsearch(
       cloud_id=es_cluster_cloud_id,
       api_key=es_cluster_api_key,
       request_timeout=120,
   )


   if not os.path.exists(RESULTS_FILE):
       print(f&quot;ERROR: {RESULTS_FILE} No dbt run results found.&quot;)
       sys.exit(1)


   with open(RESULTS_FILE, &quot;r&quot;) as json_file:
       results = json.load(json_file)
       timestamp = results[&quot;metadata&quot;][&quot;generated_at&quot;]
       metadata = results[&quot;metadata&quot;]
       elapsed_time = results[&quot;elapsed_time&quot;]
       args = results[&quot;args&quot;]
       docs = []
       for result in results[&quot;results&quot;]:
           if result[&quot;unique_id&quot;].split(&quot;.&quot;)[0] == &quot;test&quot;:
               result[&quot;_index&quot;] = DBT_TEST_LOGS_INDEX
           else:
               result[&quot;_index&quot;] = DBT_RUN_LOGS_INDEX
           result[&quot;@timestamp&quot;] = timestamp
           result[&quot;metadata&quot;] = metadata
           result[&quot;elapsed_time&quot;] = elapsed_time
           result[&quot;args&quot;] = args
           docs.append(result)
       _ = helpers.bulk(es_client, docs)
   return &quot;Done&quot;

# Call the function
log_dbt_es()
</code></pre>
<p>If you want to add/remove any other fields from <code>run_results.json</code>, you can modify the above function to do it.</p>
<p>Once the results are indexed, you can use Kibana to create Data Views for both indexes and start exploring them in Discover.</p>
<p>Go to Discover, click on the data view selector on the top left and “Create a data view”.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-dbt-pipelines-with-elastic-observability/discover-create-dataview.png" alt="2 - discover create a data view" /></p>
<p>Now you can create a data view with your preferred name. Do this for both dbt run (<code>DBT_RUN_LOGS_INDEX</code> in your code) and dbt test (<code>DBT_TEST_LOGS_INDEX</code> in your code) indices:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-dbt-pipelines-with-elastic-observability/create-dataview.png" alt="3 - create a data view" /></p>
<p>Going back to Discover, you’ll be able to select the Data Views and explore the data.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-dbt-pipelines-with-elastic-observability/discover-logs-explorer.png" alt="4 - discover logs explorer" /></p>
<h2>dbt run alerts, dashboards and ML jobs</h2>
<p>The invocation of <a href="https://docs.getdbt.com/reference/commands/run"><code>dbt run</code></a> executes compiled SQL model files against the current database. <code>dbt run</code> invocation logs contain the <a href="https://docs.getdbt.com/reference/artifacts/run-results-json">following fields</a>:</p>
<ul>
<li><code>unique_id</code>: Unique model identifier</li>
<li><code>execution_time</code>: Total time spent executing this model run</li>
</ul>
<p>The logs also contain the following metrics about the job execution from the adapter:</p>
<ul>
<li><code>adapter_response.bytes_processed</code></li>
<li><code>adapter_response.bytes_billed</code></li>
<li><code>adapter_response.slot_ms</code></li>
<li><code>adapter_response.rows_affected</code></li>
</ul>
<p>We have used Kibana to set up <a href="https://www.elastic.co/guide/en/machine-learning/current/ml-ad-run-jobs.html">Anomaly Detection jobs</a> on the above-mentioned metrics. You can configure a <a href="https://www.elastic.co/guide/en/machine-learning/current/ml-anomaly-detection-job-types.html#multi-metric-jobs">multi-metric job</a> split by <code>unique_id</code> to be alerted when the sum of rows affected, slot time consumed, or bytes billed is anomalous per table. You can track one job per metric. If you have built a dashboard of the metrics per table, you can use <a href="https://www.elastic.co/guide/en/machine-learning/8.14/ml-jobs-from-lens.html">this shortcut</a> to create the Anomaly Detection job directly from the visualization. After the jobs are created and are running on incoming data, you can <a href="https://www.elastic.co/guide/en/machine-learning/current/ml-ad-view-results.html">view the jobs</a> and add them to a dashboard using the three dots button in the anomaly timeline:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-dbt-pipelines-with-elastic-observability/ml-job-add-to-dashboard.png" alt="5 - add ML job to dashboard" /></p>
<p>We have used the <a href="https://www.elastic.co/guide/en/machine-learning/current/ml-configuring-alerts.html">ML job to set up alerts</a> that send us emails/slack messages when anomalies are detected. Alerts can be created directly from the Jobs (Machine Learning &gt; Anomaly Detection Jobs) page, by clicking on the three dots at the end of the ML job row:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-dbt-pipelines-with-elastic-observability/ml-job-create-alert.png" alt="6 - create alert from ML job" /></p>
<p>We also use <a href="https://www.elastic.co/guide/en/kibana/current/dashboard.html">Kibana dashboards</a> to visualize the anomaly detection job results and related metrics per table, to identify which tables consume most of our resources, to have visibility on their temporal evolution, and to measure aggregated metrics that can help us understand month over month changes.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-dbt-pipelines-with-elastic-observability/ml-job-dashboard.png" alt="7 - ML job in dashboard" />
<img src="https://www.elastic.co/observability-labs/assets/images/monitor-dbt-pipelines-with-elastic-observability/dashboard-slot-time.png" alt="8 - dashboard slot time chart" />
<img src="https://www.elastic.co/observability-labs/assets/images/monitor-dbt-pipelines-with-elastic-observability/dashboard-aggregated-metrics.png" alt="9 - dashboard aggregated metrics" /></p>
<h2>dbt test alerts and dashboards</h2>
<p>You may already be familiar with <a href="https://docs.getdbt.com/docs/build/data-tests">tests in dbt</a>, but if you’re not, dbt data tests are assertions you make about your models. Using the command <a href="https://docs.getdbt.com/reference/commands/test"><code>dbt test</code></a>, dbt will tell you if each test in your project passes or fails. <a href="https://docs.getdbt.com/docs/build/data-tests#example">Here is an example of how to set them up</a>. In our team, we use out-of-the-box dbt tests (<code>unique</code>, <code>not_null</code>, <code>accepted_values</code>, and <code>relationships</code>) and the packages <a href="https://hub.getdbt.com/dbt-labs/dbt_utils/latest/">dbt_utils</a> and <a href="https://hub.getdbt.com/calogica/dbt_expectations/latest/">dbt_expectations</a> for some extra tests. When the command <code>dbt test</code> is run, it generates logs that are stored in <code>run_results.json</code>.</p>
<p>dbt test logs contain the <a href="https://docs.getdbt.com/reference/artifacts/run-results-json">following fields</a>:</p>
<ul>
<li><code>unique_id</code>: Unique test identifier, tests contain the “test” prefix in their unique identifier</li>
<li><code>status</code>: result of the test, <code>pass</code> or <code>fail</code></li>
<li><code>execution_time</code>: Total time spent executing this test</li>
<li><code>failures</code>: will be 0 if the test passes and 1 if the test fails</li>
<li><code>message</code>: If the test fails, reason why it failed</li>
</ul>
<p>The logs also contain the metrics about the job execution from the adapter.</p>
<p>We have set up alerts on document count (see <a href="https://www.elastic.co/guide/en/observability/8.14/custom-threshold-alert.html">guide</a>) that will send us an email / slack message when there are any failed tests. The rule for the alerts is set up on the dbt test Data View that we have created before, the query filtering on <code>status:fail</code> to obtain the logs for the tests that have failed, and the rule condition is document count bigger than 0.
Whenever there is a failure in any test in production, we get an alert with links to the alert details and dashboards to be able to troubleshoot them:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-dbt-pipelines-with-elastic-observability/email-alert.png" alt="10 - alert" /></p>
<p>We have also built a dashboard to visualize the tests run, tests failed, and their execution time and slot time to have a historical view of the test run:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-dbt-pipelines-with-elastic-observability/dashboard-tests.png" alt="11 - dashboard dbt tests" /></p>
<h2>Finding Root Causes with the AI Assistant</h2>
<p>The most effective way for us to analyze these multiple sources of information is using the AI Assistant to help us troubleshoot the incidents. In our case, we got an alert about a test failure, and we used the AI Assistant to give us context on what happened. Then we asked if there were any downstream consequences, and the AI Assistant interpreted the results of the Anomaly Detection job, which indicated a spike in slot time for one of our downstream tables and the increase of the slot time vs. the baseline. Then, we asked for the root cause, and the AI Assistant was able to find and provide us a link to a PR from our Github changelog that matched the start of the incident and was the most probable cause.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-dbt-pipelines-with-elastic-observability/ai-assistant.png" alt="12 - ai assistant troubleshoot" /></p>
<h2>Conclusion</h2>
<p>As a Data Analytics team, we are responsible for guaranteeing that the tables, charts, models, reports, and dashboards we provide to stakeholders are accurate and contain the right sources of information. As teams grow, the number of models we own becomes larger and more interconnected, and it isn’t easy to guarantee that everything is running smoothly and providing accurate results. Having a monitoring system that proactively alerts us on cost spikes, anomalies in row counts, or data quality test failures is like having a trusted companion that will alert you in advance if something goes wrong and help you get to the root cause of the issue.</p>
<p>dbt invocation logs are a crucial source of information about the status of our data pipelines, and Elastic is the perfect tool to extract the maximum potential out of them. Use this blog post as a starting point for utilizing your dbt logs to help your team achieve greater reliability and peace of mind, allowing them to focus on more strategic tasks rather than worrying about potential data issues.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/monitor-dbt-pipelines-with-elastic-observability/monitoring-dbt-with-elastic.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Monitor your Python data pipelines with OTEL]]></title>
            <link>https://www.elastic.co/observability-labs/blog/monitor-your-python-data-pipelines-with-otel</link>
            <guid isPermaLink="false">monitor-your-python-data-pipelines-with-otel</guid>
            <pubDate>Thu, 08 Aug 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how to configure OTEL for your data pipelines, detect any anomalies, analyze performance, and set up corresponding alerts with Elastic.]]></description>
            <content:encoded><![CDATA[<p>This article delves into how to implement observability practices, particularly using <a href="https://opentelemetry.io/">OpenTelemetry (OTEL)</a> in Python, to enhance the monitoring and quality control of data pipelines using Elastic. While the primary focus of the examples presented in the article is ETL (Extract, Transform, Load) processes to ensure the accuracy and reliability of data pipelines that is crucial for Business Intelligence (BI), the strategies and tools discussed are equally applicable to Python processes used for Machine Learning (ML) models or other data processing tasks.</p>
<h2>Introduction</h2>
<p>Data pipelines, particularly ETL processes, form the backbone of modern data architectures. These pipelines are responsible for extracting raw data from various sources, transforming it into meaningful information, and loading it into data warehouses or data lakes for analysis and reporting.</p>
<p>In our organization, we have Python-based ETL scripts that play a pivotal role in exporting and processing data from Elasticsearch (ES) clusters and loading it into <a href="https://cloud.google.com/bigquery">Google BigQuery (BQ)</a>. This processed data then feeds into <a href="https://www.getdbt.com">DBT (Data Build Tool)</a> models, which further refine the data and make it available for analytics and reporting. To see the full architecture and learn how we monitor our DBT pipelines with Elastic see <a href="https://www.elastic.co/observability-labs/blog/monitor-dbt-pipelines-with-elastic-observability">Monitor your DBT pipelines with Elastic Observability</a>. In this article we focus on the ETL scripts. Given the critical nature of these scripts, it is imperative to set up mechanisms to control and ensure the quality of the data they generate.</p>
<p>The strategies discussed here can be extended to any script or application that handles data processing or machine learning models, regardless of the programming language used as long as there exists a corresponding agent that supports OTEL instrumentation.</p>
<h2>Motivation</h2>
<p>Observability in data pipelines involves monitoring the entire lifecycle of data processing to ensure that everything works as expected. It includes:</p>
<ol>
<li>Data Quality Control:</li>
</ol>
<ul>
<li>Detecting anomalies in the data, such as unexpected drops in record counts.</li>
<li>Verifying that data transformations are applied correctly and consistently.</li>
<li>Ensuring the integrity and accuracy of the data loaded into the data warehouse.</li>
</ul>
<ol start="2">
<li>Performance Monitoring:</li>
</ol>
<ul>
<li>Tracking the execution time of ETL scripts to identify bottlenecks and optimize performance.</li>
<li>Monitoring resource usage, such as memory and CPU consumption, to ensure efficient use of infrastructure.</li>
</ul>
<ol start="3">
<li>Real-time Alerting:</li>
</ol>
<ul>
<li>Setting up alerts for immediate notification of issues such as failed ETL jobs, data quality issues, or performance degradation.</li>
<li>Identify the root case of such incidents</li>
<li>Proactively addressing incidents to minimize downtime and impact on business operations</li>
</ul>
<p>Issues such as failed ETL jobs, can even point to larger infrastructure or data source data quality issues.</p>
<h2>Steps for Instrumentation</h2>
<p>Here are the steps to automatically instrument your Python script for exporting OTEL traces, metrics, and logs.</p>
<h3>Step 1: Import Required Libraries</h3>
<p>We first need to install the following libraries.</p>
<pre><code class="language-sh">pip install elastic-opentelemetry google-cloud-bigquery[opentelemetry]
</code></pre>
<p>You can also them to your project's <code>requirements.txt</code> file and install them with <code>pip install -r requirements.txt</code>.</p>
<h4>Explanation of Dependencies</h4>
<ol>
<li>
<p><strong>elastic-opentelemetry</strong>: This package is the Elastic Distribution for OpenTelemetry Python. Under the hood it will install the following packages:</p>
<ul>
<li>
<p><strong>opentelemetry-distro</strong>: This package is a convenience distribution of OpenTelemetry, which includes the OpenTelemetry SDK, APIs, and various instrumentation packages. It simplifies the setup and configuration of OpenTelemetry in your application.</p>
</li>
<li>
<p><strong>opentelemetry-exporter-otlp</strong>: This package provides an exporter that sends telemetry data to the OpenTelemetry Collector or any other endpoint that supports the OpenTelemetry Protocol (OTLP). This includes traces, metrics, and logs.</p>
</li>
<li>
<p><strong>opentelemetry-instrumentation-system-metrics</strong>: This package provides instrumentation for collecting system metrics, such as CPU usage, memory usage, and other system-level metrics.</p>
</li>
</ul>
</li>
<li>
<p><strong>google-cloud-bigquery[opentelemetry]</strong>: This package integrates Google Cloud BigQuery with OpenTelemetry, allowing you to trace and monitor BigQuery operations.</p>
</li>
</ol>
<h3>Step 2: Export OTEL Variables</h3>
<p>Set the necessary OpenTelemetry (OTEL) variables by getting the configuration from APM OTEL from Elastic.</p>
<p>Go to APM -&gt; Services -&gt; Add data (top left corner).</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-your-python-data-pipelines-with-otel/otel-variables-1.png" alt="1 - Get OTEL variables step 1" /></p>
<p>In this section you will find the steps how to configure various APM agents. Navigate to OpenTelemetry to find the variables that you need to export.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-your-python-data-pipelines-with-otel/otel-variables-2.png" alt="2 - Get OTEL variables step 2" /></p>
<p><strong>Find OTLP Endpoint</strong>:</p>
<ul>
<li>Look for the section related to OpenTelemetry or OTLP configuration.</li>
<li>The <code>OTEL_EXPORTER_OTLP_ENDPOINT</code> is typically provided as part of the setup instructions for integrating OpenTelemetry with Elastic APM. It might look something like <code>https://&lt;your-apm-server&gt;/otlp</code>.</li>
</ul>
<p><strong>Obtain OTLP Headers</strong>:</p>
<ul>
<li>In the same section, you should find instructions or a field for OTLP headers. These headers are often used for authentication purposes.</li>
<li>Copy the necessary headers provided by the interface. They might look like <code>Authorization: Bearer &lt;your-token&gt;</code>.</li>
</ul>
<p>Note: Notice you need to replace the whitespace between <code>Bearer</code> and your token with <code>%20</code> in the <code>OTEL_EXPORTER_OTLP_HEADERS</code> variable when using Python.</p>
<p>Alternatively you can use a different approach for authentication using API keys (see <a href="https://github.com/elastic/elastic-otel-python?tab=readme-ov-file#authentication">instructions</a>). If you are using our <a href="https://www.elastic.co/docs/current/serverless/general/what-is-serverless-elastic">serverless offering</a> you will need to use this approach instead.</p>
<p><strong>Set up the variables</strong>:</p>
<ul>
<li>Replace the placeholders in your script with the actual values obtained from the Elastic APM interface and execute it in your shell via the source command <code>source env.sh</code>.</li>
</ul>
<p>Below is a script to set these variables:</p>
<pre><code class="language-sh">#!/bin/bash
echo &quot;--- :otel: Setting OTEL variables&quot;
export OTEL_EXPORTER_OTLP_ENDPOINT='https://your-apm-server/otlp:443'
export OTEL_EXPORTER_OTLP_HEADERS='Authorization=Bearer%20your-token'
export OTEL_PYTHON_LOGGING_AUTO_INSTRUMENTATION_ENABLED=true
export OTEL_PYTHON_LOG_CORRELATION=true
export ELASTIC_OTEL_SYSTEM_METRICS_ENABLED=true
export OTEL_METRIC_EXPORT_INTERVAL=5000
export OTEL_LOGS_EXPORTER=&quot;otlp,console&quot;
</code></pre>
<p>With these variables set, we are ready for auto-instrumentation without needing to add anything to the code.</p>
<h4>Explanation of Variables</h4>
<ul>
<li>
<p><strong>OTEL_EXPORTER_OTLP_ENDPOINT</strong>: This variable specifies the endpoint to which OTLP data (traces, metrics, logs) will be sent. Replace <code>placeholder</code> with your actual OTLP endpoint.</p>
</li>
<li>
<p><strong>OTEL_EXPORTER_OTLP_HEADERS</strong>: This variable specifies any headers required for authentication or other purposes when sending OTLP data. Replace <code>placeholder</code> with your actual OTLP headers.</p>
</li>
<li>
<p><strong>OTEL_PYTHON_LOGGING_AUTO_INSTRUMENTATION_ENABLED</strong>: This variable enables auto-instrumentation for logging in Python, allowing logs to be automatically enriched with trace context.</p>
</li>
<li>
<p><strong>OTEL_PYTHON_LOG_CORRELATION</strong>: This variable enables log correlation, which includes trace context in log entries to correlate logs with traces.</p>
</li>
<li>
<p><strong>OTEL_METRIC_EXPORT_INTERVAL</strong>: This variable specifies the metric export interval in milliseconds, in this case 5s.</p>
</li>
<li>
<p><strong>OTEL_LOGS_EXPORTER</strong>: This variable specifies the exporter to use for logs. Setting it to &quot;otlp&quot; means that logs will be exported using the OTLP protocol. Adding &quot;console&quot; specifies that logs should be exported to both the OTLP endpoint and the console. In our case for better visibility on the infa side, we choose to export to console as well.</p>
</li>
<li>
<p><strong>ELASTIC_OTEL_SYSTEM_METRICS_ENABLED</strong>: It is needed to use this variable when using the Elastic distribution as by default it is set to false.</p>
</li>
</ul>
<p>Note: <strong>OTEL_METRICS_EXPORTER</strong> and <strong>OTEL_TRACES_EXPORTER</strong>: This variables specify the exporter to use for metrics/traces, and are set to &quot;otlp&quot; by default, which means that metrics and traces will be exported using the OTLP protocol.</p>
<h3>Running Python ETLs</h3>
<p>We run Python ETLs with the following command:</p>
<pre><code class="language-sh">OTEL_RESOURCE_ATTRIBUTES=&quot;service.name=x-ETL,service.version=1.0,deployment.environment=production&quot; &amp;&amp; opentelemetry-instrument python3 X_ETL.py 
</code></pre>
<h4>Explanation of the Command</h4>
<ul>
<li>
<p><strong>OTEL_RESOURCE_ATTRIBUTES</strong>: This variable specifies additional resource attributes, such as <a href="https://www.elastic.co/guide/en/observability/current/apm.html">service name</a>, service version and deployment environment, that will be included in all telemetry data, you can customize these values per your needs. You can use a different service name for each script.</p>
</li>
<li>
<p><strong>opentelemetry-instrument</strong>: This command auto-instruments the specified Python script for OpenTelemetry. It sets up the necessary hooks to collect traces, metrics, and logs.</p>
</li>
<li>
<p><strong>python3 X_ETL.py</strong>: This runs the specified Python script (<code>X_ETL.py</code>).</p>
</li>
</ul>
<h3>Tracing</h3>
<p>We export the traces via the default OTLP protocol.</p>
<p>Tracing is a key aspect of monitoring and understanding the performance of applications. <a href="https://www.elastic.co/guide/en/observability/current/apm-data-model-spans.html">Spans</a> form the building blocks of tracing. They encapsulate detailed information about the execution of specific code paths. They record the start and end times of activities and can have hierarchical relationships with other spans, forming a parent/child structure.</p>
<p>Spans include essential attributes such as transaction IDs, parent IDs, start times, durations, names, types, subtypes, and actions. Additionally, spans may contain stack traces, which provide a detailed view of function calls, including attributes like function name, file path, and line number, which is especially useful for debugging. These attributes help us analyze the script's execution flow, identify performance issues, and enhance optimization efforts.</p>
<p>With the default instrumentation, the whole Python script would be a single span. In our case we have decided to manually add specific spans per the different phases of the Python process, to be able to measure their latency, throughput, error rate, etc individually. This is how we define spans manually:</p>
<pre><code class="language-python">from opentelemetry import trace

if __name__ == &quot;__main__&quot;:

    tracer = trace.get_tracer(&quot;main&quot;)
    with tracer.start_as_current_span(&quot;initialization&quot;) as span:
            # Init code
            … 
    with tracer.start_as_current_span(&quot;search&quot;) as span:
            # Step 1 - Search code
            …
   with tracer.start_as_current_span(&quot;transform&quot;) as span:
           # Step 2 - Transform code
           …
   with tracer.start_as_current_span(&quot;load&quot;) as span:
           # Step 3 - Load code
           …
</code></pre>
<p>You can explore traces in the APM interface as shown below.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-your-python-data-pipelines-with-otel/Traces-APM-Observability-Elastic.png" alt="3 - APM Traces view" /></p>
<h3>Metrics</h3>
<p>We export metrics via the default OTLP protocol as well, such as CPU usage and memory. No extra code needs to be added in the script itself.</p>
<p>Note: Remember to set <code>ELASTIC_OTEL_SYSTEM_METRICS_ENABLED</code> to true.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-your-python-data-pipelines-with-otel/otel-metrics-apm-view.png" alt="4 - APM Metrics view" /></p>
<h3>Logging</h3>
<p>We export logs via the default OTLP protocol as well.</p>
<p>For logging, we modify the logging calls to add extra fields using a dictionary structure (bq_fields) as shown below:</p>
<pre><code class="language-python">        job.result()  # Waits for table load to complete
        job_details = client.get_job(job.job_id)  # Get job details

        # Extract job information
        bq_fields = {
            # &quot;slot_time_ms&quot;: job_details.slot_ms,
            &quot;job_id&quot;: job_details.job_id,
            &quot;job_type&quot;: job_details.job_type,
            &quot;state&quot;: job_details.state,
            &quot;path&quot;: job_details.path,
            &quot;job_created&quot;: job_details.created.isoformat(),
            &quot;job_ended&quot;: job_details.ended.isoformat(),
            &quot;execution_time_ms&quot;: (
                job_details.ended - job_details.created
            ).total_seconds()
            * 1000,
            &quot;bytes_processed&quot;: job_details.output_bytes,
            &quot;rows_affected&quot;: job_details.output_rows,
            &quot;destination_table&quot;: job_details.destination.table_id,
            &quot;event&quot;: &quot;BigQuery Load Job&quot;, # Custom event type
            &quot;status&quot;: &quot;success&quot;, # Status of the step (success/error)
            &quot;category&quot;: category # ETL category tag 
        }

        logging.info(&quot;BigQuery load operation successful&quot;, extra=bq_fields)
</code></pre>
<p>This code shows how to extract BQ job stats, execution time, bytes processed, rows affected and destination table among them. You can add other metadata like we do such as custom event type, status, and category.</p>
<p>Any calls to logging (of all levels above the set threshold, in this case INFO <code>logging.getLogger().setLevel(logging.INFO)</code>) will create a log that will be exported to Elastic. This means that in Python scripts that already use <code>logging</code> there is no need to make any changes to export logs to Elastic.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-your-python-data-pipelines-with-otel/otel-logs-apm-view.png" alt="5 - APM Logs view" /></p>
<p>For each of the log messages, you can go into the details view (click on the <code>…</code> when you hover over the log line and go into <code>View details</code>) to examine the metadata attached to the log message. You can also explore the logs in <a href="https://www.elastic.co/guide/en/kibana/8.14/discover.html">Discover</a>.</p>
<h4>Explanation of Logging Modification</h4>
<ul>
<li>
<p><strong>logging.info</strong>: This logs an informational message. The message &quot;BigQuery load operation successful&quot; is logged.</p>
</li>
<li>
<p><strong>extra=bq_fields</strong>: This adds additional context to the log entry using the <code>bq_fields</code> dictionary. This context can include details making the log entries more informative and easier to analyze. This data will be later used to set up alerts and data anomaly detection jobs.</p>
</li>
</ul>
<h2>Monitoring in Elastic's APM</h2>
<p>As shown, we can examine traces, metrics, and logs in the APM interface. To make the most out of this data, we make use on top of nearly the whole suit of features in Elastic Observability alongside Elastic Analytic's ML capabilities.</p>
<h3>Rules and Alerts</h3>
<p>We can set up rules and alerts to detect anomalies, errors, and performance issues in our scripts.</p>
<p>The <a href="https://www.elastic.co/guide/en/kibana/current/apm-alerts.html#apm-create-error-alert"><code>error count threshold</code> rule</a> is used to create a trigger when the number of errors in a service exceeds a defined threshold.</p>
<p>To create the rule go to Alerts and Insights -&gt; Rules -&gt; Create Rule -&gt; Error count threshold, set the error count threshold, the service or environment you want to monitor (you can also set an error grouping key across services), how often to run the check, and choose a connector.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-your-python-data-pipelines-with-otel/error-count-threshold.png" alt="6 - ETL Status Error Rule" /></p>
<p>Next, we create a rule of type <code>custom threshold</code> on a given ETL logs <a href="https://www.elastic.co/guide/en/kibana/current/data-views.html">data view</a> (create one for your index) filtering on &quot;labels.status: error&quot; to get all the logs with status error from any of the steps of the ETL which have failed. The rule condition is set to document count &gt; 0. In our case, in the last section of the rule config, we also set up Slack <a href="https://www.elastic.co/guide/en/kibana/current/alerting-getting-started.html">alerts</a> every time the rule is activated. You can pick from a long list of <a href="https://www.elastic.co/guide/en/kibana/current/action-types.html">connectors</a> Elastic supports.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-your-python-data-pipelines-with-otel/etl-fail-status-rule.png" alt="7 - ETL Status Error Rule" /></p>
<p>Then we can set up alerts for failures. We add status to the logs metadata as shown in the code sample below for each of the steps in the ETLs. It then becomes available in ES via <code>labels.status</code>.</p>
<pre><code class="language-python">logging.info(
            &quot;Elasticsearch search operation successful&quot;,
            extra={
                &quot;event&quot;: &quot;Elasticsearch Search&quot;,
                &quot;status&quot;: &quot;success&quot;,
                &quot;category&quot;: category,
                &quot;index&quot;: index,
            },
        )
</code></pre>
<h3>More Rules</h3>
<p>We could also add rules to detect anomalies in the execution time of the different spans we define. This is done by selecting transaction/span -&gt; Alerts and rules -&gt; Custom threshold rule -&gt; Latency. In the example below, we want to generate an alert whenever the search step takes more than 25s.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-your-python-data-pipelines-with-otel/apm_custom_threshold_latency.png" alt="8 - APM Custom Threshold - Latency" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-your-python-data-pipelines-with-otel/apm_custom_threshold_latency_2.png" alt="9 - APM Custom Threshold - Config" /></p>
<p>Alternatively, for finer-grained control, you can go with Alerts and rules -&gt; Anomaly rule, set up an anomaly job, and pick a threshold severity level.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-your-python-data-pipelines-with-otel/apm_anomaly_rule_config.png" alt="10 - APM Anomaly Rule - Config" /></p>
<h3>Anomaly detection job</h3>
<p>In this example we set an anomaly detection job on the number of documents before transform.</p>
<p>We set up an <a href="https://www.elastic.co/guide/en/machine-learning/current/ml-ad-run-jobs.html">Anomaly Detection jobs</a> on the number of document before the transform using the [Single metric job] (<a href="https://www.elastic.co/guide/en/machine-learning/current/ml-anomaly-detection-job-types.html#multi-metric-jobs">https://www.elastic.co/guide/en/machine-learning/current/ml-anomaly-detection-job-types.html#multi-metric-jobs</a>) to detect any anomalies with the incoming data source.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-your-python-data-pipelines-with-otel/single-metrics.png" alt="11 - Single Metrics" /></p>
<p>In the last step, you can create alerting similarly to what we did before to receive alerts whenever there is an anomaly detected, by setting up a severity level threshold. Using the anomaly score which is assigned to every anomaly, every anomaly is characterized by a severity level.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-your-python-data-pipelines-with-otel/anomaly-detection-alerting-1.png" alt="12 - Anomaly detection Alerting - Severity" /></p>
<p>Similarly to the previous example, we set up a Slack connector to receive alerts whenever an anomaly is detected.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-your-python-data-pipelines-with-otel/anomaly-detection-alerting-connectors.png" alt="13 - Anomaly detection Alerting - Connectors" /></p>
<p>You can go to your custom dashboard by going to Add Panel -&gt; ML -&gt; Anomaly Swim Lane -&gt; Pick your job.</p>
<p>Similarly, we add jobs for the number of documents after the transform, and a Multi-Metric one on the <code>execution_time_ms</code>, <code>bytes_processed</code> and <code>rows_affected</code> similarly to how it was done in <a href="https://www.elastic.co/observability-labs/blog/monitor-dbt-pipelines-with-elastic-observability">Monitor your DBT pipelines with Elastic Observability</a>.</p>
<h2>Custom Dashboard</h2>
<p>Now that your logs, metrics, and traces are in Elastic, you can use the full potential of our Kibana dashboards to extract the most from them. We can create a custom dashboard like the following one: a pie chart based on <code>labels.event</code> (category field for every type of step in the ETLs), a chart for every type of step broken down by status, a timeline of steps broken down by status, BQ stats for the ETL, and anomaly detection swim lane panels for the various anomaly jobs.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-your-python-data-pipelines-with-otel/custom_dashboard.png" alt="14 - Custom Dashboard" /></p>
<h2>Conclusion</h2>
<p>Elastic’s APM, in combination with other Observability and ML features, provides a unified view of our data pipelines, allowing us to bring a lot of value with minimal code changes:</p>
<ul>
<li>Logging of new logs (no need to add custom logging) alongside their execution context</li>
<li>Monitor the runtime behavior of our models</li>
<li>Track data quality issues</li>
<li>Identify and troubleshoot real-time incidents</li>
<li>Optimize performance bottlenecks and resource usage</li>
<li>Identify dependencies on other services and their latency</li>
<li>Optimize data transformation processes</li>
<li>Set up alerts on latency, data quality issues, error rates of transactions or CPU usage)</li>
</ul>
<p>With these capabilities, we can ensure the resilience and reliability of our data pipelines, leading to more robust and accurate BI system and reporting.</p>
<p>In conclusion, setting up OpenTelemetry (OTEL) in Python for data pipeline observability has significantly improved our ability to monitor, detect, and resolve issues proactively. This has led to more reliable data transformations, better resource management, and enhanced overall performance of our data transformation, BI and Machine Learning systems.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/monitor-your-python-data-pipelines-with-otel/main_image.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[NGNIX log analytics with GenAI in Elastic]]></title>
            <link>https://www.elastic.co/observability-labs/blog/nginx-log-analytics-with-genai-elastic</link>
            <guid isPermaLink="false">nginx-log-analytics-with-genai-elastic</guid>
            <pubDate>Fri, 05 Jul 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Elastic has a set of embedded capabilities such as a GenAI RAG-based AI Assistant and a machine learning platform as part of the product baseline. These make analyzing the vast number of logs you get from NGINX easier.]]></description>
            <content:encoded><![CDATA[<p>Elastic Observability provides a full observability solution, supporting metrics, traces, and logs for applications and infrastructure. NGINX, which is highly used for web serving, load balancing, http caching, and reverse proxy, is the key to many applications and outputs a large volume of logs. NGINX’s access logs, which detail all requests made to the NGINX server, and error logs which record server-related issues and problems are key to managing and analyzing NGINX issues along with understanding what is happening to your application. </p>
<p>In managing NGINX Elastic provides several capabilities:</p>
<ol>
<li>
<p>Easy ingest, parsing, and out-of-the-box dashboards. Check out the simple how-to in our <a href="https://www.elastic.co/guide/en/fleet/current/example-standalone-monitor-nginx.html">docs</a>. Based on logs, these dashboards show several items over time, response codes, errors, top pages, data volume, browsers used, active connections, drop rates, and much more.</p>
</li>
<li>
<p>Out-of-the-box ML-based anomaly detection jobs for your NGINX logs. These jobs help pinpoint anomalies against request rates, IP address request rates, URL access, status codes, and visitor rate anomalies.</p>
</li>
<li>
<p>ES|QL which helps work through logs and build out charts during analysis.</p>
</li>
<li>
<p>Elastic’s GenAI Assistant provides a simple natural language interface that helps analyze all the logs and can pull out issues from ML jobs and even create dashboards. The Elastic AI Assistant also automatically uses ES|QL.</p>
</li>
<li>
<p>NGINX SLOs - Finally Elastic provides the ability to define and monitor SLOs for your NGINX logs. While most SLOs are metrics-based, Elastic allows you to create logs-based SLOs. We detailed this in a previous <a href="https://www.elastic.co/observability-labs/blog/service-level-objectives-slos-logs-metrics">blog</a>.</p>
</li>
</ol>
<p>NGINX logs are another example of why logs are great.  Logging is an important part of Observability, for which we generally think of metrics and tracing. However, the amount of logs an application and the underlying infrastructure output can be significantly daunting and NGINX is usually the starting point for most analyses. </p>
<p>In today’s blog, we’ll cover how the out-of-the-box ML-based anomaly detection jobs can help RCA, and how Elastic’s GenAI Assistant helps easily work through logs to pinpoint issues in minutes. </p>
<h2>Prerequisites and config&lt;a id=&quot;prerequisites-and-config&quot;&gt;&lt;/a&gt;</h2>
<p>If you plan on following this blog, here are some of the components and details we used to set up this demonstration:</p>
<ul>
<li>
<p>Ensure you have an account on <a href="http://cloud.elastic.co">Elastic Cloud</a> and a deployed stack (<a href="https://www.elastic.co/guide/en/elastic-stack/current/installing-elastic-stack.html">see instructions here</a>).</p>
</li>
<li>
<p>Bring up an <a href="https://docs.nginx.com/nginx/admin-guide/web-server/">NGINX server</a> on a host. OR run an application with NGINX as a front end and drive traffic.</p>
</li>
<li>
<p>Install the NGINX integration and assets and review the dashboards as noted in the <a href="https://www.elastic.co/guide/en/fleet/current/example-standalone-monitor-nginx.html">docs</a>.</p>
</li>
<li>
<p>Ensure you have an <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/ml-settings.html">ML node configured</a> in your Elastic stack</p>
</li>
<li>
<p>To use the AI Assistant you will need a trial or upgrade to Platinum.</p>
</li>
</ul>
<p>In our scenario, we use data from 3 months from our Elastic environment to help highlight the features. Hence you might need to run your application with traffic for a specific time frame to follow along.</p>
<h2>Analyzing the issues with AI Assistant&lt;a id=&quot;analyzing-the-issues-with-ai-assistant&quot;&gt;&lt;/a&gt;</h2>
<p>As detailed in a previous <a href="https://www.elastic.co/observability-labs/blog/service-level-objectives-slos-logs-metrics">blog</a>, you can get alerted on issues via SLO monitoring against NGINX logs. Let’s assume you have an SLO based on status codes as we outlined in the previous <a href="https://www.elastic.co/observability-labs/blog/service-level-objectives-slos-logs-metrics">blog</a>. You can immediately analyze the issue via the AI Assistant. Because it's a chat interface we simply open the AI Assistant and work through some simple analysis: (See Animated GIF for a demo)</p>
<h3>AI Assistant analysis:&lt;a id=&quot;ai-assistant-analysis&quot;&gt;&lt;/a&gt;</h3>
<ul>
<li>
<p><strong><em>Using lens graph all http response status codes &lt; 400 and &gt; =400 from filebeat-nginx-elasticco-anon-2017. http.response.status.code is not an integer</em></strong> <em>-</em> We wanted to simply understand the amount of requests resulting in status code &gt;= 400 and graph the results. We see that 15% of the requests were not successful, hence an SLO alert being triggered.</p>
</li>
<li>
<p><strong>Which ip address (field source.adress) has the highest number of http.response.status.code &gt;= 400 from filebeat-nginx-elasticco-anon-2017. http.response.status.code is not an integer</strong>  - We were curious is there was a specific IP address not having successful requests. 72.57.0.53, with a count of 25,227 occurrences is daily high but not the ensure 2 failed requests.</p>
</li>
<li>
<p><strong><em>What country (source.geo.country_iso_code) is source.address=72.57.0.53 coming from. Use filebeat-nginx-elasticco-anon-2017.</em></strong> - Again we were curious if this came from a specific country. And the IP address 72.57.0.53 is coming from the country with the ISO code IN, which corresponds to India. Nothing out of the ordinary.</p>
</li>
<li>
<p><strong><em>Did source.address=72.57.0.53 have any (http.response.status.code &lt; 400) from filebeat-nginx-elasticco-anon-2017. http.response.status.code is not an integer -</em></strong>  Oddly the IP address in question only had 4000+ successful responses. Meaning its not malicious, and points to something else.</p>
</li>
<li>
<p><strong><em>What are the different status codes (http.response.status.code&gt;=400), from source.address=72.57.0.53. Use filebeat-nginx-elasticco-anon-2017. http.response.status.code is not an integer. Provide counts for each status code -</em></strong> We are curious whether or not we see any 502, which there were none, but most of the failures were 404. </p>
</li>
<li>
<p><strong><em>What are the different status codes (http.response.status.code&gt;=400). Use filebeat-nginx-elasticco-anon-2017. http.response.status.code is not an integer. Provide counts for each status code</em></strong> - Regardless of a specific address, what is the largest number of status code occurrences &gt; 400. This also points to 404. </p>
</li>
<li>
<p><strong><em>What does a high 404 count from a specific IP address mean from NGINX logs?</em></strong> - Asking this question, we need to understand the potential causes of this from our application. From the answers, we can rule out security probing and web scraping, as we validated that a specific address 72.57.0.53 has a low non-success request status code. It also rules out User error. Hence this points potentially to Broken Links or Missing Resources.</p>
</li>
</ul>
<h3>Watch the flow:&lt;a id=&quot;watch-the-flow&quot;&gt;&lt;/a&gt;</h3>
&lt;Video vidyardUuid=&quot;ak9xDdhcL3SxpqU7CRsD68&quot; /&gt;
<h3>Potential issue:</h3>
<p>It seems that we potentially have an issue with the backend serving specific answers or having issues with resources (database, or broken links). This is cursing the higher-than-normal non-successful status codes&gt;=400.</p>
<h3>Key highlights from AI Assistant:</h3>
<p>As you watched this video you will notice a few things:</p>
<ol>
<li>
<p>We analyzed millions of logs in a matter of minutes using a set of simple natural language queries. </p>
</li>
<li>
<p>We didn’t need to know any special query language. The AI Assistant used Elastic’s ES|QL but can similarly use KQL also. </p>
</li>
<li>
<p>The AI Assistant easily builds out graphs</p>
</li>
<li>
<p>The AI Assistant is accessing and using internal information stored in Elastic’s indices. Vs a simple “google foo” based AI Assistant. This is enabled through RAG, and the AI Assistant can also bring up known issues in github, runbooks, and other useful internal information.</p>
</li>
</ol>
<p>Check out the following <a href="https://www.elastic.co/observability-labs/blog/elastic-rag-ai-assistant-application-issues-llm-github">blog</a> on how the AI Assistant uses RAG to retrieve internal information. Specifically using github and runbooks.</p>
<h2>Locating anomalies with ML</h2>
<p>While using the AI Assistant is great for analyzing information, another important aspect of NGINX log management is to ensure you can manage log spikes and anomalies. Elastic has a machine learning platform that allows you to develop jobs to analyze specific metrics or multiple metrics to look for anomalies.When using NGINX, there are several <a href="https://www.elastic.co/guide/en/machine-learning/current/ootb-ml-jobs-nginx.html">out-of-the-box anomaly detection jobs</a>. These work specifically on NGINX access logs.</p>
<ul>
<li>
<p>Low_request_rate_nginx - Detect low request rates</p>
</li>
<li>
<p>Source_ip_request_rate_nginx - Detect unusual source IPs - high request rates</p>
</li>
<li>
<p>Source_ip_url_count_nginx - Detect unusual source IPs - high distinct count of URLs</p>
</li>
<li>
<p>Status_code_rate_nginx - Detect unusual status code rates</p>
</li>
<li>
<p>Visitor_rate_nginx - Detect unusual visitor rates</p>
</li>
</ul>
<p>Being right out of the box, lets look at the job - Status_code_rate_nginx, which is related to our previous analysis.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/nginx-log-analytics-with-genai-elastic/nginx-ml-log-analytics.png" alt="NGINX ML Log Analytics" /></p>
<p>With a few simple clicks we immediately get an analysis showing a specific IP address - 72.57.0.53, having higher than normal non-successful requests. Oddly we also found this is using the AI Assistant.</p>
<p>We can take this further with conversations with the AI Assistant, look at the logs, and/or even look at the other ML anomaly jobs.</p>
<h2>Conclusion:&lt;a id=&quot;conclusion&quot;&gt;&lt;/a&gt;</h2>
<p>You’ve now seen how easily Elastic’s RAG-based AI Assistant can help analyze NGINX logs without even the need to know query syntax, understand where the data is, and understand even the fields. Additionally, you’ve also seen how we can alert you when a potential issue or degradation in service (SLO). </p>
<p>Check out other resources on NGINX logs:</p>
<p><a href="https://www.elastic.co/guide/en/machine-learning/current/ootb-ml-jobs-nginx.html">Out-of-the-box anomaly detection jobs for NGINX</a></p>
<p><a href="https://www.elastic.co/guide/en/fleet/current/example-standalone-monitor-nginx.html">Using the NGINX integration to ingest and analyze NGINX Logs</a></p>
<p><a href="https://www.elastic.co/observability-labs/blog/service-level-objectives-slos-logs-metrics">NGINX Logs based SLOs in Elastic</a></p>
<p><a href="https://www.elastic.co/observability-labs/blog/elastic-rag-ai-assistant-application-issues-llm-github">Using GitHub issues, runbooks, and other internal information for RCAs with Elastic’s RAG based AI Assistant</a></p>
<h2>Try it out&lt;a id=&quot;try-it-out&quot;&gt;&lt;/a&gt;</h2>
<p>Existing Elastic Cloud customers can access many of these features directly from the <a href="https://cloud.elastic.co/">Elastic Cloud console</a>. Not taking advantage of Elastic on the cloud? <a href="https://www.elastic.co/cloud/cloud-trial-overview">Start a free trial</a>.</p>
<p>All of this is also possible in your environment. <a href="https://www.elastic.co/observability/universal-profiling">Learn how to get started today</a>.</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
<p><em>In this blog post, we may have used or referred to third party generative AI tools, which are owned and operated by their respective owners. Elastic does not have any control over the third party tools and we have no responsibility or liability for their content, operation or use, nor for any loss or damage that may arise from your use of such tools. Please exercise caution when using AI tools with personal, sensitive or confidential information. Any data you submit may be used for AI training or other purposes. There is no guarantee that information you provide will be kept secure or confidential. You should familiarize yourself with the privacy practices and terms of use of any generative AI tools prior to use.</em></p>
<p><em>Elastic, Elasticsearch, ESRE, Elasticsearch Relevance Engine and associated marks are trademarks, logos or registered trademarks of Elasticsearch N.V. in the United States and other countries. All other company and product names are trademarks, logos or registered trademarks of their respective owners.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/nginx-log-analytics-with-genai-elastic/blog-thumb-observability-pattern-color.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Root cause analysis with logs: Elastic Observability's AIOps Labs]]></title>
            <link>https://www.elastic.co/observability-labs/blog/observability-logs-machine-learning-aiops</link>
            <guid isPermaLink="false">observability-logs-machine-learning-aiops</guid>
            <pubDate>Thu, 27 Apr 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[Elastic Observability provides more than just log aggregation, metrics analysis, APM, and distributed tracing. Our machine learning-based AIOps capabilities help you analyze the root cause of issues allowing you to focus on the most important tasks.]]></description>
            <content:encoded><![CDATA[<p>In the <a href="https://www.elastic.co/blog/reduce-mttd-ml-machine-learning-observability">previous blog</a> in our root cause analysis with logs series, we explored how to analyze logs in Elastic Observability with Elastic’s anomaly detection and log categorization capabilities. Elastic’s platform enables you to get started on machine learning (ML) quickly. You don’t need to have a data science team or design a system architecture. Additionally, there’s no need to move data to a third-party framework for model training.</p>
<p>Preconfigured <a href="https://www.elastic.co/blog/may-2023-launch-machine-learning-models">machine learning models</a> for observability and security are available. If those don't work well enough on your data, in-tool wizards guide you through the few steps needed to configure custom anomaly detection and train your model with supervised learning. To get you started, there are several key features built into Elastic Observability to aid in analysis, bypassing the need to run specific ML models. These features help minimize the time and analysis of logs.</p>
<p>Let’s review the set of machine learning-based observability features in Elastic:</p>
<p><strong>Anomaly detection:</strong> Elastic Observability, when turned on (<a href="https://www.elastic.co/guide/en/kibana/current/xpack-ml-anomalies.html">see documentation</a>), automatically detects anomalies by continuously modeling the normal behavior of your time series data — learning trends, periodicity, and more — in real time to identify anomalies, streamline root cause analysis, and reduce false positives. Anomaly detection runs in and scales with Elasticsearch and includes an intuitive UI.</p>
<p><strong>Log categorization:</strong> Using anomaly detection, Elastic also identifies patterns in your log events quickly. Instead of manually identifying similar logs, the logs categorization view lists log events that have been grouped, based on their messages and formats, so that you can take action more quickly.</p>
<p><strong>High-latency or erroneous transactions:</strong> Elastic Observability’s APM capability helps you discover which attributes are contributing to increased transaction latency and identifies which attributes are most influential in distinguishing between transaction failures and successes. Read <a href="https://www.elastic.co/blog/apm-correlations-elastic-observability-root-cause-transactions">APM correlations in Elastic Observability: Automatically identifying probable causes of slow or failed transactions</a> for an overview of this capability.</p>
<p><strong>AIOps Labs:</strong> AIOps Labs provides two main capabilities using advanced statistical methods:</p>
<ul>
<li><strong>Log spike detector</strong> helps identify reasons for increases in log rates. It makes it easy to find and investigate the causes of unusual spikes by using the analysis workflow view. Examine the histogram chart of the log rates for a given data view, and find the reason behind a particular change possibly in millions of log events across multiple fields and values.</li>
<li><strong>Log pattern analysis</strong> helps you find patterns in unstructured log messages and makes it easier to examine your data. It performs categorization analysis on a selected field of a data view, creates categories based on the data, and displays them together with a chart that shows the distribution of each category and an example document that matches the category.</li>
</ul>
<p>As we showed in the <a href="https://www.elastic.co/blog/reduce-mttd-ml-machine-learning-observability">last blog</a>, using machine learning-based features helps minimize the extremely tedious and time-consuming process of analyzing data using traditional methods, such as alerting and simple pattern matching (visual or simple searching, etc.). Trying to find the needle in the haystack requires the use of some level of artificial intelligence due to the increasing amounts of telemetry data (logs, metrics, and traces) being collected across ever-growing applications.</p>
<p>In this blog post, we’ll cover two capabilities found in Elastic’s AIOps Labs: log spike detector and log pattern analysis. We’ll use the same data from the previous blog and analyze it using these two capabilities.</p>
<p>_ <strong>We will cover log spike detector and log pattern analysis against the popular Hipster Shop app developed by Google, and modified recently by OpenTelemetry.</strong> _</p>
<p>Overviews of high-latency capabilities can be found <a href="https://www.elastic.co/blog/apm-correlations-elastic-observability-root-cause-transactions">here</a>, and an overview of AIOps labs can be found <a href="https://www.youtube.com/watch?v=jgHxzUNzfhM&amp;list=PLhLSfisesZItlRZKgd-DtYukNfpThDAv_&amp;index=5">here</a>.</p>
<p>Below, we will examine a scenario where we use anomaly detection and log categorization to help identify a root cause of an issue in Hipster Shop.</p>
<h2>Prerequisites and config</h2>
<p>If you plan on following this blog, here are some of the components and details we used to set up this demonstration:</p>
<ul>
<li>Ensure you have an account on <a href="http://cloud.elastic.co">Elastic Cloud</a> and a deployed stack (<a href="https://www.elastic.co/guide/en/elastic-stack/current/installing-elastic-stack.html">see instructions here</a>) on AWS. Deploying this on AWS is required for Elastic Serverless Forwarder.</li>
<li>Utilize a version of the popular <a href="https://github.com/GoogleCloudPlatform/microservices-demo">Hipster Shop</a> demo application. It was originally written by Google to showcase Kubernetes across a multitude of variants available, such as the <a href="https://github.com/open-telemetry/opentelemetry-demo">OpenTelemetry Demo App</a>. The Elastic version is found <a href="https://github.com/elastic/opentelemetry-demo">here</a>.</li>
<li>Ensure you have configured the app for either Elastic APM agents or OpenTelemetry agents. For more details, please refer to these two blogs: <a href="https://www.elastic.co/blog/opentelemetry-observability">Independence with OTel in Elastic</a> and <a href="https://www.elastic.co/blog/implementing-kubernetes-observability-security-opentelemetry">Observability and Security with OTel in Elastic</a>. Additionally, review the <a href="https://www.elastic.co/guide/en/apm/guide/current/open-telemetry.html">OTel documentation in Elastic</a>.</li>
<li>Look through an overview of <a href="https://www.elastic.co/guide/en/observability/current/apm.html">Elastic Observability APM capabilities</a>.</li>
<li>Look through our <a href="https://www.elastic.co/guide/en/observability/8.5/inspect-log-anomalies.html">anomaly detection documentation</a> for logs and <a href="https://www.elastic.co/guide/en/observability/8.5/categorize-logs.html">log categorization documentation</a>.</li>
</ul>
<p>Once you’ve instrumented your application with APM (Elastic or OTel) agents and are ingesting metrics and logs into Elastic Observability, you should see a service map for the application as follows:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-logs-machine-learning-aiops/blog-elastic-observability-service-map.png" alt="observability service map" /></p>
<p>In our example, we’ve introduced issues to help walk you through the root cause analysis features. You might have a different set of issues depending on how you load the application and/or introduce specific feature flags.</p>
<p>As part of the walk-through, we’ll assume we are DevOps or SRE managing this application in production.</p>
<h2>Root cause analysis</h2>
<p>While the application has been running normally for some time, you get a notification that some of the services are unhealthy. This can occur from the notification setting you’ve set up in Elastic or other external notification platforms (including customer-related issues). In this instance, we’re assuming that customer support has called in multiple customer complaints about the website.</p>
<p>How do you as a DevOps or SRE investigate this? We will walk through two avenues in Elastic to investigate the issue:</p>
<ul>
<li>Log spike analysis</li>
<li>Log pattern analysis</li>
</ul>
<p>While we show these two paths separately, they can be used in conjunction and are complementary, as they are both tools Elastic Observability provides to help you troubleshoot and identify a root cause.</p>
<p>Starting with the service map, you can see anomalies identified with red circles and as we select them, Elastic will provide a score for the anomaly.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-logs-machine-learning-aiops/blog-elastic-observability-service-map-service-details.png" alt="observability service map service details" /></p>
<p>In this example, we can see that there is a score of 96 for a specific anomaly for the productCatalogService in the Hipster Shop application. An anomaly score indicates the significance of the anomaly compared to previously seen anomalies. Rather than jump into anomaly detection (see previous <a href="https://www.elastic.co/blog/reduce-mttd-ml-machine-learning-observability">blog</a>), let’s look at some of the potential issues by reviewing the service details in APM.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-logs-machine-learning-aiops/blog-elastic-observability-product-catalog-service-overview.png" alt="observability product catalog service overview" /></p>
<p>What we see for the productCatalogService is that there are latency issues, failed transactions, a large number of issues, and a dependency to PostgreSQL. When we look at the errors in more detail and drill down, we see they are all coming from <a href="https://pkg.go.dev/github.com/lib/pq">PQ - which is a PostgreSQL driver in Go</a>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-logs-machine-learning-aiops/blog-elastic-observability-product-catalog-service-errors.png" alt="observability product catalog service errors" /></p>
<p>As we drill further, we still can’t tell why the productCatalogService is not able to pull information from the PostgreSQL database.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-logs-machine-learning-aiops/blog-elastic-observability-product-catalog-service-error-group.png" alt="observability product catalog service error group" /></p>
<p>We see that there is a spike in errors, so let's see if we can gleam further insight using one of our two options:</p>
<ul>
<li>Log rate spikes</li>
<li>Log pattern analysis</li>
</ul>
<h3>Log rate spikes</h3>
<p>Let’s start with the <strong>log rate spikes</strong> detector capability from Elastic’s AIOps Labs section of Elastic’s machine learning capabilities. We also pre-select analyzing the spike against a baseline history.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-logs-machine-learning-aiops/blog-elastic-observability-explain-log-rate-spikes-postgres.png" alt="explain log rate spikes postgres" /></p>
<p>The log rate spikes detector has looked at all the logs from the spike and compared them to the baseline, and it's seeing higher-than-normal counts in specific log messages. From a visual inspection, we see that PostgreSQL log messages are high. We further filter this with postgres.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-logs-machine-learning-aiops/blog-elastic-observability-explain-log-rate-spikes-pgbench.png" alt="explain log rates spikes pgbench" /></p>
<p>We immediately notice that this issue is potentially caused by pgbench, a popular PostgreSQL tool to help benchmark the database. pgbench runs the same sequence of SQL commands over and over, possibly in multiple, concurrent database sessions. While pgbench is definitely a useful tool, it should not be used in a production environment as it causes a heavy load on the database host, likely causing higher latency issues on the site.</p>
<p>While this may or may not be the ultimate root cause, we have rather quickly identified a potential issue that has a high probability of being the root cause. An engineer likely intended to run pgbench against a staging database to evaluate its performance, and not the production environment.</p>
<h3>Log pattern analysis</h3>
<p>Instead of log rate spikes, let’s use log pattern analysis to investigate the spike in errors we saw in productCatalogService. In AIOps Labs, we simply select Log Pattern Analysis, use Logs data, filter the results with postgres (since we know it's related to PostgreSQL), and look at information from the message field of the logs we are processing. We see the following:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-logs-machine-learning-aiops/blog-elastic-observability-explain-log-pattern-analysis.png" alt="observability explain log pattern analysis" /></p>
<p>Almost immediately we see the biggest pattern it finds is a log message where pgbench is updating the database. We can further directly drill into this log message from log pattern analysis into Discover and review the details and further analyze the messages.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-logs-machine-learning-aiops/blog-elastic-observability-expanded-document.png" alt="expanded document" /></p>
<p>As we mentioned in the previous section, while it may or may not be the root cause, it quickly gives us a place to start and a potential root cause. A developer likely intended to run pgbench against a staging database to evaluate its performance, and not the production environment.</p>
<h2>Conclusion</h2>
<p>Between the <a href="https://www.elastic.co/blog/reduce-mttd-ml-machine-learning-observability">first blog</a> and this one, we’ve shown how Elastic Observability can help you further identify and get closer to pinpointing the root cause of issues without having to look for a “needle in a haystack.” Here’s a quick recap of what you learned in this blog.</p>
<ul>
<li>
<p>Elastic Observability has numerous capabilities to help you reduce your time to find the root cause and improve your MTTR (even MTTD). In particular, we reviewed the following two main capabilities (found in AIOps Labs in Elastic) in this blog:</p>
<ol>
<li><strong>Log rate spikes</strong> detector helps identify reasons for increases in log rates. It makes it easy to find and investigate the causes of unusual spikes by using the analysis workflow view. Examine the histogram chart of the log rates for a given data view, and find the reason behind a particular change possibly in millions of log events across multiple fields and values.</li>
<li><strong>Log pattern analysis</strong> helps you find patterns in unstructured log messages and makes it easier to examine your data. It performs categorization analysis on a selected field of a data view, creates categories based on the data, and displays them together with a chart that shows the distribution of each category and an example document that matches the category.</li>
</ol>
</li>
<li>
<p>You learned how easy and simple it is to use Elastic Observability’s log categorization and anomaly detection capabilities without having to understand machine learning (which helps drive these features) or having to do any lengthy setups.</p>
</li>
</ul>
<p>Ready to get started? <a href="https://cloud.elastic.co/registration">Register for Elastic Cloud</a> and try out the features and capabilities outlined above.</p>
<h3>Additional logging resources:</h3>
<ul>
<li><a href="https://www.elastic.co/getting-started/observability/collect-and-analyze-logs">Getting started with logging on Elastic (quickstart)</a></li>
<li><a href="https://www.elastic.co/guide/en/observability/current/logs-metrics-get-started.html">Ingesting common known logs via integrations (compute node example)</a></li>
<li><a href="https://docs.elastic.co/integrations">List of integrations</a></li>
<li><a href="https://www.elastic.co/blog/log-monitoring-management-enterprise">Ingesting custom application logs into Elastic</a></li>
<li><a href="https://www.elastic.co/blog/observability-logs-parsing-schema-read-write">Enriching logs in Elastic</a></li>
<li>Analyzing Logs with <a href="https://www.elastic.co/blog/reduce-mttd-ml-machine-learning-observability">Anomaly Detection (ML)</a> and <a href="https://www.elastic.co/blog/observability-logs-machine-learning-aiops">AIOps</a></li>
</ul>
<h3>Common use case examples with logs:</h3>
<ul>
<li><a href="https://youtu.be/ax04ZFWqVCg">Nginx log management</a></li>
<li><a href="https://www.elastic.co/blog/vpc-flow-logs-monitoring-analytics-observability">AWS VPC Flow log management</a></li>
<li><a href="https://www.elastic.co/blog/kubernetes-errors-elastic-observability-logs-openai">Using OpenAI to analyze Kubernetes errors</a></li>
<li><a href="https://youtu.be/Li5TJAWbz8Q">PostgreSQL issue analysis with AIOps</a></li>
</ul>
<p><em>Elastic and Elasticsearch are trademarks, logos or registered trademarks of Elasticsearch B.V. in the United States and other countries.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/observability-logs-machine-learning-aiops/illustration-machine-learning-anomaly-1680x980.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Monitoring service performance: An overview of SLA calculation for Elastic Observability]]></title>
            <link>https://www.elastic.co/observability-labs/blog/observability-sla-calculations-transforms</link>
            <guid isPermaLink="false">observability-sla-calculations-transforms</guid>
            <pubDate>Mon, 24 Apr 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[Elastic Stack provides many valuable insights for different users, such as reports on service performance and if the service level agreement (SLA) is met. In this post, we’ll provide an overview of calculating an SLA for Elastic Observability.]]></description>
            <content:encoded><![CDATA[<p>Elastic Stack provides many valuable insights for different users. Developers are interested in low-level metrics and debugging information. <a href="https://www.elastic.co/blog/elastic-observability-sre-incident-response">SREs</a> are interested in seeing everything at once and identifying where the root cause is. Managers want reports that tell them how good service performance is and if the service level agreement (SLA) is met. In this post, we’ll focus on the service perspective and provide an overview of calculating an SLA.</p>
<p><em>Since version 8.8, we have a built in functionality to calculate SLOs —</em> <a href="https://www.elastic.co/guide/en/observability/current/slo.html"><em>check out our guide</em></a><em>!</em></p>
<h2>Foundations of calculating an SLA</h2>
<p>There are many ways to calculate and measure an SLA. The most important part is the definition of the SLA, and as a consultant, I’ve seen many different ways. Some examples include:</p>
<ul>
<li>Count of HTTP 2xx must be above 98% of all HTTP status</li>
<li>Response time of successful HTTP 2xx requests must be below x milliseconds</li>
<li>Synthetic monitor must be up at least 99%</li>
<li>95% of all batch transactions from the billing service need to complete within 4 seconds</li>
</ul>
<p>Depending on the origin of the data, calculating the SLA can be easier or more difficult. For uptime (Synthetic Monitoring), we automatically provide SLA values and offer out-of-the-box alerts to simply define alert when availability below 98% for the last 1 hour.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-sla-calculations-transforms/blog-elastic-overview-monitor-details.png" alt="overview monitor details" /></p>
<p>I personally recommend using <a href="https://www.elastic.co/blog/new-synthetic-monitoring-observability">Elastic Synthetic Monitoring</a> whenever possible to monitor service performance. Running HTTP requests and verifying the answers from the service, or doing fully fledged browser monitors and clicking through the website as a real user does, ensures a better understanding of the health of your service.</p>
<p>Sometimes this is impossible because you want to calculate the uptime of a specific Windows Service that does not offer any TCP port or HTTP interaction. Here the caveat applies that just because the service is running, it does not necessarily imply that the service is working fine.</p>
<h2>Transforms to the rescue</h2>
<p>We have identified our important service. In our case, it is the Steam Client Helper. There are two ways to solve this.</p>
<h3>Lens formula</h3>
<p>You can use Lens and formula (for a deep dive into formulas, <a href="https://www.elastic.co/blog/how-tough-was-your-workout-take-a-closer-look-at-strava-data-through-kibana-lens">check out this blog</a>). Use the Search bar to filter down the data you want. Then use the formula option in Lens. We are dividing all counts of records with Running as a state and dividing it by the overall count of records. This is a nice solution when there is a need to calculate quickly and on the fly.</p>
<pre><code class="language-sql">count(kql='windows.service.state: &quot;Running&quot; ')/count()
</code></pre>
<p>Using the formula posted above as the bar chart's vertical axis calculates the uptime percentage. We use an annotation to mark why there is a dip and why this service was below the threshold. The annotation is set to reboot, which indicates a reboot happening, and thus, the service was down for a moment. Lastly, we add a reference line and set this to our defined threshold at 98%. This ensures that a quick look at the visualization allows our eyes to gauge if we are above or below the threshold.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-sla-calculations-transforms/blog-elastic-visualization.png" alt="visualization" /></p>
<h3>Transform</h3>
<p>What if I am not interested in just one service, but there are multiple services needed for your SLA? That is where Transforms can solve this problem. Furthermore, the second issue is that this data is only available inside the Lens. Therefore, we cannot create any alerts on this.</p>
<p>Go to Transforms and create a pivot transform.</p>
<ol>
<li>
<p>Add the following filter to narrow it to only services data sets: data_stream.dataset: &quot;windows.service&quot;. If you are interested in a specific service, you can always add it to the search bar if you want to know if a specific remote management service is up in your entire fleet!</p>
</li>
<li>
<p>Select data histogram(@timestamp) and set it to your chosen unit. By default, the Elastic Agent only collects service states every 60 seconds. I am going with 1 hour.</p>
</li>
<li>
<p>Select agent.name and windows.service.name as well.</p>
</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-sla-calculations-transforms/blog-elastic-transform-configuration.png" alt="transform configuration" /></p>
<ol start="4">
<li>Now we need to define an aggregation type. We will use a value_count of windows.service.state. That just counts how many records have this value.</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-sla-calculations-transforms/blog-elastic-aggregations.png" alt="aggregations" /></p>
<ol start="5">
<li>
<p>Rename the value_count to total_count.</p>
</li>
<li>
<p>Add value_count for windows.service.state a second time and use the pencil icon to edit it to terms, which aggregates for running.</p>
</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-sla-calculations-transforms/blog-elastic-aggregations-apply.png" alt="aggregations apply" /></p>
<ol start="7">
<li>
<p>This opens up a sub-aggregation. Once again, select value_count(windows.service.state) and rename it to values.</p>
</li>
<li>
<p>Now, the preview shows us the count of records with any states and the count of running.</p>
</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-sla-calculations-transforms/blog-elastic-transform-configuration-next.png" alt="transform configuration" /></p>
<ol start="9">
<li>
<p>Here comes the tricky part. We need to write some custom aggregations to calculate the percentage of uptime. Click on the copy icon next to the edit JSON config.</p>
</li>
<li>
<p>In a new tab, go to Dev Tools. Paste what you have in the clipboard.</p>
</li>
<li>
<p>Press the play button or use the keyboard shortcut ctrl+enter/cmd+enter and run it. This will create a preview of what the data looks like. It should give you the same information as in the table preview.</p>
</li>
<li>
<p>Now, we need to calculate the percentage of up, which means doing a bucket script where we divide running.values by total_count, just like we did in the Lens visualization. Suppose you name the columns differently or use more than a single value. In that case, you will need to adapt accordingly.</p>
</li>
</ol>
<pre><code class="language-json">&quot;availability&quot;: {
        &quot;bucket_script&quot;: {
          &quot;buckets_path&quot;: {
            &quot;up&quot;: &quot;running&gt;values&quot;,
            &quot;total&quot;: &quot;total_count&quot;
          },
          &quot;script&quot;: &quot;params.up/params.total&quot;
        }
      }
</code></pre>
<ol start="13">
<li>This is the entire transform for me:</li>
</ol>
<pre><code class="language-bash">POST _transform/_preview
{
  &quot;source&quot;: {
    &quot;index&quot;: [
      &quot;metrics-*&quot;
    ]
  },
  &quot;pivot&quot;: {
    &quot;group_by&quot;: {
      &quot;@timestamp&quot;: {
        &quot;date_histogram&quot;: {
          &quot;field&quot;: &quot;@timestamp&quot;,
          &quot;calendar_interval&quot;: &quot;1h&quot;
        }
      },
      &quot;agent.name&quot;: {
        &quot;terms&quot;: {
          &quot;field&quot;: &quot;agent.name&quot;
        }
      },
      &quot;windows.service.name&quot;: {
        &quot;terms&quot;: {
          &quot;field&quot;: &quot;windows.service.name&quot;
        }
      }
    },
    &quot;aggregations&quot;: {
      &quot;total_count&quot;: {
        &quot;value_count&quot;: {
          &quot;field&quot;: &quot;windows.service.state&quot;
        }
      },
      &quot;running&quot;: {
        &quot;filter&quot;: {
          &quot;term&quot;: {
            &quot;windows.service.state&quot;: &quot;Running&quot;
          }
        },
        &quot;aggs&quot;: {
          &quot;values&quot;: {
            &quot;value_count&quot;: {
              &quot;field&quot;: &quot;windows.service.state&quot;
            }
          }
        }
      },
      &quot;availability&quot;: {
        &quot;bucket_script&quot;: {
          &quot;buckets_path&quot;: {
            &quot;up&quot;: &quot;running&gt;values&quot;,
            &quot;total&quot;: &quot;total_count&quot;
          },
          &quot;script&quot;: &quot;params.up/params.total&quot;
        }
      }
    }
  }
}
</code></pre>
<ol start="14">
<li>The preview in Dev Tools should work and be complete. Otherwise, you must debug any errors. Most of the time, it is the bucket script and the path to the values. You might have called it up instead of running. This is what the preview looks like for me.</li>
</ol>
<pre><code class="language-json">{
  &quot;running&quot;: {
    &quot;values&quot;: 1
  },
  &quot;agent&quot;: {
    &quot;name&quot;: &quot;AnnalenasMac&quot;
  },
  &quot;@timestamp&quot;: &quot;2021-12-07T19:00:00.000Z&quot;,
  &quot;total_count&quot;: 1,
  &quot;availability&quot;: 1,
  &quot;windows&quot;: {
    &quot;service&quot;: {
      &quot;name&quot;: &quot;InstallService&quot;
    }
  }
},
</code></pre>
<ol start="15">
<li>Now we only paste the bucket script into the transform creation UI after selecting Edit JSON. It looks like this:</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-sla-calculations-transforms/blog-elastic-transform-configuration-pivot-configuration-object.png" alt="transform configuration pivot configuration object" /></p>
<ol start="16">
<li>Give your transform a name, set the destination index, and run it continuously. When selecting this, please also make sure not to use @timestamp. Instead, opt for event.ingested. <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/transform-checkpoints.html">Our documentation explains this in detail</a>.</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-sla-calculations-transforms/blog-elastic-transform-details.png" alt="transform details" /></p>
<ol start="17">
<li>Click next and create and start. This can take a bit, so don’t worry.</li>
</ol>
<p>To summarize, we have now created a pivot transform using a bucket script aggregation to calculate the running time of a service in percentage. There is a caveat because Elastic Agent, per default, only collects the every 60 seconds the services state. It can be that a service is up exactly when collected and down a few seconds later. If it is that important and no other monitoring possibilities, such as <a href="https://www.elastic.co/blog/what-can-elastic-synthetics-tell-us-about-kibana-dashboards">Elastic Synthetics</a> are possible, you might want to reduce the collection time on the Agent side to get the services state every 30 seconds, 45 seconds. Depending on how important your thresholds are, you can create multiple policies having different collection times. This ensures that a super important server might collect the services state every 10 seconds because you need as much granularity and insurance for the correctness of the metric. For normal workstations where you just want to know if your remote access solution is up the majority of the time, you might not mind having a single metric every 60 seconds.</p>
<p>After you have created the transform, one additional feature you get is that the data is stored in an index, similar to in Elasticsearch. When you just do the visualization, the metric is calculated for this visualization only and not available anywhere else. Since this is now data, you can create a threshold alert to your favorite connection (Slack, Teams, Service Now, Mail, and so <a href="https://www.elastic.co/guide/en/kibana/current/action-types.html">many more to choose from</a>).</p>
<h2>Visualizing the transformed data</h2>
<p>The transform created a data view called windows-service. The first thing we want to do is change the format of the availability field to a percentage. This automatically tells Lens that this needs to be formatted as a percentage field, so you don’t need to select it manually as well as do calculations. Furthermore, in Discover, instead of seeing 0.5 you see 50%. Isn’t that cool? This is also possible for durations, like event.duration if you have it as nanoseconds! No more calculations on the fly and thinking if you need to divide by 1,000 or 1,000,000.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-sla-calculations-transforms/blog-elastic-edit-field-availability.png" alt="edit field availability" /></p>
<p>We get this view by using a simple Lens visualization with a timestamp on the vertical axis with the minimum interval for 1 day and an average of availability. Don’t worry — the other data will be populated once the transformation finishes. We can add a reference line using the value 0.98 because our target is 98% uptime of the service.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observability-sla-calculations-transforms/blog-elastic-line.png" alt="line" /></p>
<h2>Summary</h2>
<p>This blog post covered the steps needed to calculate the SLA for a specific data set in Elastic Observability, as well as how to visualize it. Using this calculation method opens the door to a lot of interesting use cases. You can change the bucket script and start calculating the number of sales, and the average basket size. Interested in learning more about Elastic Synthetics? Read <a href="https://www.elastic.co/guide/en/observability/current/monitor-uptime-synthetics.html">our documentation</a> or check out our free <a href="https://www.elastic.co/training/synthetics-quick-start">Synthetic Monitoring Quick Start training</a>.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/observability-sla-calculations-transforms/illustration-analytics-report-1680x980.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Collecting OpenShift container logs using Red Hat’s OpenShift Logging Operator]]></title>
            <link>https://www.elastic.co/observability-labs/blog/openshift-container-logs-red-hat-logging-operator</link>
            <guid isPermaLink="false">openshift-container-logs-red-hat-logging-operator</guid>
            <pubDate>Tue, 16 Jan 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how to optimize OpenShift logs collected with Red Hat OpenShift Logging Operator, as well as format and route them efficiently in Elasticsearch.]]></description>
            <content:encoded><![CDATA[<p>This blog explores a possible approach to collecting and formatting OpenShift Container Platform logs and audit logs with Red Hat OpenShift Logging Operator. We recommend using Elastic® Agent for the best possible experience! We will also show how to format the logs to Elastic Common Schema (<a href="https://www.elastic.co/guide/en/ecs/current/index.html">ECS</a>) for the best experience viewing, searching, and visualizing your logs. All examples in this blog are based on OpenShift 4.14.</p>
<h2>Why use OpenShift Logging Operator?</h2>
<p>A lot of enterprise customers use OpenShift as their orchestrating solution. The advantages of this approach are:</p>
<ul>
<li>
<p>It is developed and supported by Red Hat</p>
</li>
<li>
<p>It can automatically update the OpenShift cluster along with the Operating system to make sure that they are and remain compatible</p>
</li>
<li>
<p>It can speed up developing life cycles with features like source to image</p>
</li>
<li>
<p>It uses enhanced security</p>
</li>
</ul>
<p>In our consulting experience, this latter aspect poses challenges and frictions with OpenShift administrators when we try to install an Elastic Agent to collect the logs of the pods. Indeed, Elastic Agent requires the files of the host to be mounted in the pod, and it also needs to be run in privileged mode. (Read more about the permissions required by Elastic Agent in the <a href="https://www.elastic.co/guide/en/fleet/current/running-on-kubernetes-standalone.html#_red_hat_openshift_configuration">official Elasticsearch® Documentation</a>). While the solution we explore in this post requires similar privileges under the hood, it is managed by the OpenShift Logging Operator, which is developed and supported by Red Hat.</p>
<h2>Which logs are we going to collect?</h2>
<p>In OpenShift Container Platform, we distinguish <a href="https://docs.openshift.com/container-platform/4.14/logging/cluster-logging.html#logging-architecture-overview_cluster-logging">three broad categories of logs</a>: audit, application, and infrastructure logs:</p>
<ul>
<li>
<p><strong>Audit logs</strong> describe the list of activities that affected the system by users, administrators, and other components.</p>
</li>
<li>
<p><strong>Application logs</strong> are composed of the container logs of the pods running in non-reserved namespaces.</p>
</li>
<li>
<p><strong>Infrastructure logs</strong> are composed of container logs of the pods running in reserved namespaces like openshift*, kube*, and default along with journald messages from the nodes.</p>
</li>
</ul>
<p>In the following, we will consider only audit and application logs for the sake of simplicity. In this post, we will describe how to format audit and application Logs in the format expected by the Kubernetes integration to take the most out of Elastic Observability.</p>
<h2>Getting started</h2>
<p>To collect the logs from OpenShift, we must perform some preparation steps in Elasticsearch and OpenShift.</p>
<h3>Inside Elasticsearch</h3>
<p>We first <a href="https://www.elastic.co/guide/en/fleet/8.11/install-uninstall-integration-assets.html#install-integration-assets">install the Kubernetes integration assets</a>. We are mainly interested in the index templates and ingest pipelines for the logs-kubernetes.container_logs and logs-kubernetes.audit_logs.</p>
<p>To format the logs received from the ClusterLogForwarder in <a href="https://www.elastic.co/guide/en/ecs/current/index.html">ECS</a> format, we will define a pipeline to normalize the container logs. The field naming convention used by OpenShift is slightly different from that used by ECS. To get a list of exported fields from OpenShift, refer to <a href="https://docs.openshift.com/container-platform/4.14/logging/cluster-logging-exported-fields.html">Exported fields | Logging | OpenShift Container Platform 4.14</a>. To get a list of exported fields of the Kubernetes integration, you can refer to <a href="https://www.elastic.co/guide/en/beats/filebeat/current/exported-fields-kubernetes-processor.html">Kubernetes fields | Filebeat Reference [8.11] | Elastic</a> and <a href="https://www.elastic.co/guide/en/observability/current/logs-app-fields.html">Logs app fields | Elastic Observability [8.11]</a>. Further, specific fields like kubernetes.annotations must be normalized by replacing dots with underscores. This operation is usually done automatically by Elastic Agent.</p>
<pre><code class="language-bash">PUT _ingest/pipeline/openshift-2-ecs
{
  &quot;processors&quot;: [
    {
      &quot;rename&quot;: {
        &quot;field&quot;: &quot;kubernetes.pod_id&quot;,
        &quot;target_field&quot;: &quot;kubernetes.pod.uid&quot;,
        &quot;ignore_missing&quot;: true
      }
    },
    {
      &quot;rename&quot;: {
        &quot;field&quot;: &quot;kubernetes.pod_ip&quot;,
        &quot;target_field&quot;: &quot;kubernetes.pod.ip&quot;,
        &quot;ignore_missing&quot;: true
      }
    },
    {
      &quot;rename&quot;: {
        &quot;field&quot;: &quot;kubernetes.pod_name&quot;,
        &quot;target_field&quot;: &quot;kubernetes.pod.name&quot;,
        &quot;ignore_missing&quot;: true
      }
    },
    {
      &quot;rename&quot;: {
        &quot;field&quot;: &quot;kubernetes.namespace_name&quot;,
        &quot;target_field&quot;: &quot;kubernetes.namespace&quot;,
        &quot;ignore_missing&quot;: true
      }
    },
    {
      &quot;rename&quot;: {
        &quot;field&quot;: &quot;kubernetes.namespace_id&quot;,
        &quot;target_field&quot;: &quot;kubernetes.namespace_uid&quot;,
        &quot;ignore_missing&quot;: true
      }
    },
    {
      &quot;rename&quot;: {
        &quot;field&quot;: &quot;kubernetes.container_id&quot;,
        &quot;target_field&quot;: &quot;container.id&quot;,
        &quot;ignore_missing&quot;: true
      }
    },
    {
      &quot;dissect&quot;: {
        &quot;field&quot;: &quot;container.id&quot;,
        &quot;pattern&quot;: &quot;%{container.runtime}://%{container.id}&quot;,
        &quot;ignore_failure&quot;: true
      }
    },
    {
      &quot;rename&quot;: {
        &quot;field&quot;: &quot;kubernetes.container_image&quot;,
        &quot;target_field&quot;: &quot;container.image.name&quot;,
        &quot;ignore_missing&quot;: true
      }
    },
    {
      &quot;set&quot;: {
        &quot;field&quot;: &quot;kubernetes.container.image&quot;,
        &quot;copy_from&quot;: &quot;container.image.name&quot;,
        &quot;ignore_failure&quot;: true
      }
    },
    {
      &quot;set&quot;: {
        &quot;copy_from&quot;: &quot;kubernetes.container_name&quot;,
        &quot;field&quot;: &quot;container.name&quot;,
        &quot;ignore_failure&quot;: true
      }
    },
    {
      &quot;rename&quot;: {
        &quot;field&quot;: &quot;kubernetes.container_name&quot;,
        &quot;target_field&quot;: &quot;kubernetes.container.name&quot;,
        &quot;ignore_missing&quot;: true
      }
    },
    {
      &quot;set&quot;: {
        &quot;field&quot;: &quot;kubernetes.node.name&quot;,
        &quot;copy_from&quot;: &quot;hostname&quot;,
        &quot;ignore_failure&quot;: true
      }
    },
    {
      &quot;rename&quot;: {
        &quot;field&quot;: &quot;hostname&quot;,
        &quot;target_field&quot;: &quot;host.name&quot;,
        &quot;ignore_missing&quot;: true
      }
    },
    {
      &quot;rename&quot;: {
        &quot;field&quot;: &quot;level&quot;,
        &quot;target_field&quot;: &quot;log.level&quot;,
        &quot;ignore_missing&quot;: true
      }
    },
    {
      &quot;rename&quot;: {
        &quot;field&quot;: &quot;file&quot;,
        &quot;target_field&quot;: &quot;log.file.path&quot;,
        &quot;ignore_missing&quot;: true
      }
    },
    {
      &quot;set&quot;: {
        &quot;copy_from&quot;: &quot;openshift.cluster_id&quot;,
        &quot;field&quot;: &quot;orchestrator.cluster.name&quot;,
        &quot;ignore_failure&quot;: true
      }
    },
    {
      &quot;dissect&quot;: {
        &quot;field&quot;: &quot;kubernetes.pod_owner&quot;,
        &quot;pattern&quot;: &quot;%{_tmp.parent_type}/%{_tmp.parent_name}&quot;,
        &quot;ignore_missing&quot;: true
      }
    },
    {
      &quot;lowercase&quot;: {
        &quot;field&quot;: &quot;_tmp.parent_type&quot;,
        &quot;ignore_missing&quot;: true
      }
    },
    {
      &quot;set&quot;: {
        &quot;field&quot;: &quot;kubernetes.pod.{{_tmp.parent_type}}.name&quot;,
        &quot;value&quot;: &quot;{{_tmp.parent_name}}&quot;,
        &quot;if&quot;: &quot;ctx?._tmp?.parent_type != null&quot;,
        &quot;ignore_failure&quot;: true
      }
    },
    {
      &quot;remove&quot;: {
        &quot;field&quot;: [
          &quot;_tmp&quot;,
          &quot;kubernetes.pod_owner&quot;
          ],
          &quot;ignore_missing&quot;: true
      }
    },
    {
      &quot;script&quot;: {
        &quot;description&quot;: &quot;Normalize kubernetes annotations&quot;,
        &quot;if&quot;: &quot;ctx?.kubernetes?.annotations != null&quot;,
        &quot;source&quot;: &quot;&quot;&quot;
        def keys = new ArrayList(ctx.kubernetes.annotations.keySet());
        for(k in keys) {
          if (k.indexOf(&quot;.&quot;) &gt;= 0) {
            def sanitizedKey = k.replace(&quot;.&quot;, &quot;_&quot;);
            ctx.kubernetes.annotations[sanitizedKey] = ctx.kubernetes.annotations[k];
            ctx.kubernetes.annotations.remove(k);
          }
        }
        &quot;&quot;&quot;
      }
    },
    {
      &quot;script&quot;: {
        &quot;description&quot;: &quot;Normalize kubernetes namespace_labels&quot;,
        &quot;if&quot;: &quot;ctx?.kubernetes?.namespace_labels != null&quot;,
        &quot;source&quot;: &quot;&quot;&quot;
        def keys = new ArrayList(ctx.kubernetes.namespace_labels.keySet());
        for(k in keys) {
          if (k.indexOf(&quot;.&quot;) &gt;= 0) {
            def sanitizedKey = k.replace(&quot;.&quot;, &quot;_&quot;);
            ctx.kubernetes.namespace_labels[sanitizedKey] = ctx.kubernetes.namespace_labels[k];
            ctx.kubernetes.namespace_labels.remove(k);
          }
        }
        &quot;&quot;&quot;
      }
    },
    {
      &quot;script&quot;: {
        &quot;description&quot;: &quot;Normalize special Kubernetes Labels used in logs-kubernetes.container_logs to determine service.name and service.version&quot;,
        &quot;if&quot;: &quot;ctx?.kubernetes?.labels != null&quot;,
        &quot;source&quot;: &quot;&quot;&quot;
        def keys = new ArrayList(ctx.kubernetes.labels.keySet());
        for(k in keys) {
          if (k.startsWith(&quot;app_kubernetes_io_component_&quot;)) {
            def sanitizedKey = k.replace(&quot;app_kubernetes_io_component_&quot;, &quot;app_kubernetes_io_component/&quot;);
            ctx.kubernetes.labels[sanitizedKey] = ctx.kubernetes.labels[k];
            ctx.kubernetes.labels.remove(k);
          }
        }
        &quot;&quot;&quot;
      }
    }
    ]
}
</code></pre>
<p>Similarly, to handle the audit logs like the ones collected by Kubernetes, we define an ingest pipeline:</p>
<pre><code class="language-bash">PUT _ingest/pipeline/openshift-audit-2-ecs
{
  &quot;processors&quot;: [
    {
      &quot;script&quot;: {
        &quot;source&quot;: &quot;&quot;&quot;
        def audit = [:];
        def keyToRemove = [];
        for(k in ctx.keySet()) {
          if (k.indexOf('_') != 0 &amp;&amp; !['@timestamp', 'data_stream', 'openshift', 'event', 'hostname'].contains(k)) {
            audit[k] = ctx[k];
            keyToRemove.add(k);
          }
        }
        for(k in keyToRemove) {
          ctx.remove(k);
        }
        ctx.kubernetes=[&quot;audit&quot;:audit];
        &quot;&quot;&quot;,
        &quot;description&quot;: &quot;Move all the 'kubernetes.audit' fields under 'kubernetes.audit' object&quot;
      }
    },
    {
      &quot;set&quot;: {
        &quot;copy_from&quot;: &quot;openshift.cluster_id&quot;,
        &quot;field&quot;: &quot;orchestrator.cluster.name&quot;,
        &quot;ignore_failure&quot;: true
      }
    },
    {
      &quot;set&quot;: {
        &quot;field&quot;: &quot;kubernetes.node.name&quot;,
        &quot;copy_from&quot;: &quot;hostname&quot;,
        &quot;ignore_failure&quot;: true
      }
    },
    {
      &quot;rename&quot;: {
        &quot;field&quot;: &quot;hostname&quot;,
        &quot;target_field&quot;: &quot;host.name&quot;,
        &quot;ignore_missing&quot;: true
      }
    },
    {
      &quot;script&quot;: {
        &quot;if&quot;: &quot;ctx?.kubernetes?.audit?.annotations != null&quot;,
        &quot;source&quot;: &quot;&quot;&quot;
          def keys = new ArrayList(ctx.kubernetes.audit.annotations.keySet());
          for(k in keys) {
            if (k.indexOf(&quot;.&quot;) &gt;= 0) {
              def sanitizedKey = k.replace(&quot;.&quot;, &quot;_&quot;);
              ctx.kubernetes.audit.annotations[sanitizedKey] = ctx.kubernetes.audit.annotations[k];
              ctx.kubernetes.audit.annotations.remove(k);
            }
          }
          &quot;&quot;&quot;,
        &quot;description&quot;: &quot;Normalize kubernetes audit annotations field as expected by the Integration&quot;
      }
    }
  ]
}
</code></pre>
<p>The main objective of the pipeline is to mimic what Elastic Agent is doing: storing all audit fields under the kubernetes.audit object.</p>
<p>We are not going to use the conventional @custom pipeline approach because the fields must be normalized before invoking the logs-kubernetes.container_logs integration pipeline that uses fields like kubernetes.container.name and kubernetes.labels to determine the fields service.name and service.version. Read more about custom pipelines in <a href="https://www.elastic.co/guide/en/fleet/8.11/data-streams-pipeline-tutorial.html#data-streams-pipeline-one">Tutorial: Transform data with custom ingest pipelines | Fleet and Elastic Agent Guide [8.11]</a>.</p>
<p>The OpenShift Cluster Log Forwarder writes the data in the indices app-write and audit-write by default. It is possible to change this behavior, but it still tries to prepend the prefix “app” and the suffix “write”, so we opted to send the data to the default destination and use the reroute processor to send it to the right data streams. Read more about the Reroute Processor in our blog <a href="https://www.elastic.co/blog/simplifying-log-data-management-flexible-routing-elastic">Simplifying log data management: Harness the power of flexible routing with Elastic</a> and our documentation <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/reroute-processor.html">Reroute processor | Elasticsearch Guide [8.11] | Elastic</a>.</p>
<p>In this case, we want to redirect the container logs (app-write index) to logs-kubernetes.container_logs and the Audit logs (audit-write) to logs-kubernetes.audit_logs:</p>
<pre><code class="language-bash">PUT _ingest/pipeline/app-write-reroute-pipeline
{
  &quot;processors&quot;: [
    {
      &quot;pipeline&quot;: {
        &quot;name&quot;: &quot;openshift-2-ecs&quot;,
        &quot;description&quot;: &quot;Format the Openshift data in ECS&quot;
      }
    },
    {
      &quot;set&quot;: {
        &quot;field&quot;: &quot;event.dataset&quot;,
        &quot;value&quot;: &quot;kubernetes.container_logs&quot;
      }
    },
    {
      &quot;reroute&quot;: {
        &quot;destination&quot;: &quot;logs-kubernetes.container_logs-openshift&quot;
      }
    }
  ]
}



PUT _ingest/pipeline/audit-write-reroute-pipeline
{
  &quot;processors&quot;: [
    {
      &quot;pipeline&quot;: {
        &quot;name&quot;: &quot;openshift-audit-2-ecs&quot;,
        &quot;description&quot;: &quot;Format the Openshift data in ECS&quot;
      }
    },
    {
      &quot;set&quot;: {
        &quot;field&quot;: &quot;event.dataset&quot;,
        &quot;value&quot;: &quot;kubernetes.audit_logs&quot;
      }
    },
    {
      &quot;reroute&quot;: {
        &quot;destination&quot;: &quot;logs-kubernetes.audit_logs-openshift&quot;
      }
    }
  ]
}
</code></pre>
<p>Please note that given that app-write and audit-write do not follow the data stream naming convention, we are forced to add the destination field in the reroute processor. The reroute processor will also fill up the <a href="https://www.elastic.co/guide/en/ecs/8.11/ecs-data_stream.html">data_stream fields</a> for us. Note that this step is done automatically by Elastic Agent at source.</p>
<p>Further, we create the indices with the default pipelines we created to reroute the logs according to our needs.</p>
<pre><code class="language-bash">PUT app-write
{
  &quot;settings&quot;: {
      &quot;index.default_pipeline&quot;: &quot;app-write-reroute-pipeline&quot;
   }
}


PUT audit-write
{
  &quot;settings&quot;: {
    &quot;index.default_pipeline&quot;: &quot;audit-write-reroute-pipeline&quot;
  }
}
</code></pre>
<p>Basically, what we did can be summarized in this picture:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/openshift-container-logs-red-hat-logging-operator/openshift-summary-blog.png" alt="openshift-summary-blog" /></p>
<p>Let us take the container logs. When the operator attempts to write in the app-write index, it will invoke the default_pipeline “app-write-reroute-pipeline” that formats the logs into ECS format and reroutes the logs to logs-kubernetes.container_logs-openshift datastreams. This calls the integration pipeline that invokes, if it exists, the logs-kubernetes.container_logs@custom pipeline. Finally, the logs-kubernetes_container_logs pipeline may reroute the logs to another data set and namespace utilizing the elastic.co/dataset and elastic.co/namespace annotations as described in the Kubernetes <a href="https://docs.elastic.co/integrations/kubernetes/container-logs#rerouting-based-on-pod-annotations">integration documentation</a>, which in turn can lead to the execution of an another integration pipeline.</p>
<h3>Create a user for sending the logs</h3>
<p>We are going to use basic authentication because, at the time of writing, it is the only supported authentication method for Elasticsearch in OpenShift logging. Thus, we need a role that allows the user to write and read the app-write, and audit-write logs (required by the OpenShift agent) and auto_configure access to logs-*-* to allow custom Kubernetes rerouting:</p>
<pre><code class="language-bash">PUT _security/role/YOURROLE
{
    &quot;cluster&quot;: [
      &quot;monitor&quot;
    ],
    &quot;indices&quot;: [
      {
        &quot;names&quot;: [
          &quot;logs-*-*&quot;
        ],
        &quot;privileges&quot;: [
          &quot;auto_configure&quot;,
          &quot;create_doc&quot;
        ],
        &quot;allow_restricted_indices&quot;: false
      },
      {
        &quot;names&quot;: [
          &quot;app-write&quot;,
          &quot;audit-write&quot;,
        ],
        &quot;privileges&quot;: [
          &quot;create_doc&quot;,
          &quot;read&quot;
        ],
        &quot;allow_restricted_indices&quot;: false
      }
    ],
    &quot;applications&quot;: [],
    &quot;run_as&quot;: [],
    &quot;metadata&quot;: {},
    &quot;transient_metadata&quot;: {
      &quot;enabled&quot;: true
    }

}



PUT _security/user/YOUR_USERNAME
{
  &quot;password&quot;: &quot;YOUR_PASSWORD&quot;,
  &quot;roles&quot;: [&quot;YOURROLE&quot;]
}
</code></pre>
<h3>On OpenShift</h3>
<p>On the OpenShift Cluster, we need to follow the <a href="https://docs.openshift.com/container-platform/4.14/logging/log_collection_forwarding/log-forwarding.html">official documentation</a> of Red Hat on how to install the Red Hat OpenShift Logging and configure Cluster Logging and the Cluster Log Forwarder.</p>
<p>We need to install the Red Hat OpenShift Logging Operator, which defines the ClusterLogging and ClusterLogForwarder Resources. Afterward, we can define the Cluster Logging resource:</p>
<pre><code class="language-yaml">apiVersion: logging.openshift.io/v1
kind: ClusterLogging
metadata:
  name: instance
  namespace: openshift-logging
spec:
  collection:
    logs:
      type: vector
      vector: {}
</code></pre>
<p>The Cluster Log Forwarder is the resource responsible for defining a daemon set that will forward the logs to the remote Elasticsearch. Before creating it, we need to create in the same namespace as the ClusterLogForwarder a secret containing the Elasticsearch credentials for the user we created previously in the namespace, where the ClusterLogForwarder will be deployed:</p>
<pre><code class="language-yaml">apiVersion: v1
kind: Secret
metadata:
  name: elasticsearch-password
  namespace: openshift-logging
type: Opaque
stringData:
  username: YOUR_USERNAME
  password: YOUR_PASSWORD
</code></pre>
<p>Finally, we create the ClusterLogForwarder resource:</p>
<pre><code class="language-yaml">kind: ClusterLogForwarder
apiVersion: logging.openshift.io/v1
metadata:
  name: instance
  namespace: openshift-logging
spec:
  outputs:
    - name: remote-elasticsearch
      secret:
        name: elasticsearch-password
      type: elasticsearch
      url: &quot;https://YOUR_ELASTICSEARCH_URL:443&quot;
      elasticsearch:
        version: 8 # The default is version 6 with the _type field
  pipelines:
    - inputRefs:
        - application
        - audit
      name: enable-default-log-store
      outputRefs:
        - remote-elasticsearch
</code></pre>
<p>Note that we explicitly defined the version of Elasticsearch to be 8, otherwise the ClusterLogForwarder will send the _type field, which is not compatible with Elasticsearch 8 and that we collect only application and audit logs.</p>
<h2>Result</h2>
<p>Once the logs are collected and passed through all the pipelines, the result is very close to the out-of-the-box Kubernetes integration. There are important differences, like the lack of host and cloud metadata information that don’t seem to be collected (at least without an additional configuration). We can view the Kubernetes container logs in the logs explorer:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/openshift-container-logs-red-hat-logging-operator/openshift-summary-blog-graphs.png" alt="openshift-summary-blog-graphs" /></p>
<p>In this post, we described how you can use the OpenShift Logging Operator to collect the logs of containers and audit logs. We still recommend leveraging Elastic Agent to collect all your logs. It is the best user experience you can get. No need to maintain or transform the logs yourself to ECS formatting. Additionally, Elastic Agent uses API keys as the authentication method and collects metadata like cloud information that allow you in the long run to do <a href="https://www.elastic.co/blog/optimize-cloud-resources-cost-apm-metadata-elastic-observability">more</a>.</p>
<p><a href="https://www.elastic.co/observability/log-monitoring">Learn more about log monitoring with the Elastic Stack</a>.</p>
<p><em>Have feedback on this blog?</em> <a href="https://github.com/herrBez/elastic-blog-openshift-logging/issues"><em>Share it here</em></a><em>.</em></p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/openshift-container-logs-red-hat-logging-operator/139687_-_Blog_Header_Banner_V1.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Monitor your C++ Applications with Elastic APM]]></title>
            <link>https://www.elastic.co/observability-labs/blog/opentelemetry-cpp-elastic</link>
            <guid isPermaLink="false">opentelemetry-cpp-elastic</guid>
            <pubDate>Tue, 11 Feb 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[In this article we will be using the Opentelemetry CPP client to monitor C++ application within Elastic APM]]></description>
            <content:encoded><![CDATA[<p>Monitor your C++ Applications with Elastic APM</p>
<h1>Introduction</h1>
<p>One of the main challenges that developers, SREs, and DevOps professionals face is the absence of an extensive tool that provides them with visibility to their application stack. Many of the APM solutions out on the market do provide methods to monitor applications that were built on languages and frameworks (i.e., .NET, Java, Python, etc.) but fall short when it comes to C++ applications.</p>
<p>Luckily, Elastic has been one of the leading solutions in observability space and a contributor to the OpenTelemetry project. Elastic’s unique position and its extensive observability capabilities allows end-users to monitor applications built with object-oriented programming languages &amp; Framework in a variety of ways.</p>
<p>In this blog we will explore using Elastic APM to investigate C++ traces with the OpenTelemetry client. We will be providing a comprehensive guide on how to implement the OpenTelemetry client for C++ applications and connecting to Elastic APM solutions. While OTel has its libraries, and this blog reviews how to use the OTel CPP library, Elastic also has its own Elastic Distributions of OpenTelemetry, which were developed to provide commercial support, and are completely upstreamed regularly.</p>
<p>Here are some resources to help get you started:</p>
<ul>
<li>
<p><a href="https://www.elastic.co/guide/en/observability/current/apm-open-telemetry.html">Use OpenTelemetry with APM</a></p>
</li>
<li>
<p><a href="https://github.com/open-telemetry/opentelemetry-cpp">The OpenTelemetry C++ Client</a></p>
</li>
<li>
<p><a href="https://opentelemetry.io/docs/languages/cpp/">OpenTelemetry C++ Docs</a></p>
</li>
</ul>
<h1>Step by Step Guide</h1>
<h2>Prerequisites</h2>
<ul>
<li>
<h3>Environment</h3>
</li>
</ul>
<p>Choosing an environment is quite important as there is limited support for the OTEL client. We have experimented with using multiple Operating Systems and here are the suggestions:</p>
<ul>
<li>
<p>Ubuntu 22.04</p>
</li>
<li>
<p>Debian 11 Bullseye</p>
</li>
<li>
<p>For this guide we are focusing on Ubuntu 22.04.</p>
<ul>
<li>
<p>Machine: 2 vCPU, 4GB is sufficient.</p>
</li>
<li>
<p>Image: Ubuntu 22.04 LTS (x86_64).</p>
</li>
<li>
<p>Disk: ~30 GB is enough.</p>
</li>
</ul>
</li>
</ul>
<h2>Implementation method </h2>
<p>We have experimented with multiple methods but we found that the most suitable approach is to use a package manager. After extensive testing, It appears that trying to run otel-cpp client could be quite challenging to the users. If practitioners desire to build with tools such as CMake and Bazel that is a viable solution. With that, as we tested both methods it became obvious that we were spending most of our time and effort fixing compatibility and dependencies’ issues for the OS Vs. Focusing on sending data to our APM. Hence we decided to move to a different method.</p>
<p>The main issues that we kept running into as we test are:</p>
<ul>
<li>
<p>Compatibility of packages.</p>
</li>
<li>
<p>Availability of packages.</p>
</li>
<li>
<p>Dependencies of libraries and packages.</p>
</li>
</ul>
<p>In this guide we will use vcpkg since it allows us to bring in all the dependencies required to run the Opentelemetry C++ client.</p>
<h2>Installing required OS tools</h2>
<h3>Update package lists</h3>
<pre><code>    sudo apt-get update
</code></pre>
<p>Install build essentials, cmake, git, and sqlite dev library</p>
<pre><code>    sudo apt-get install -y build-essential cmake git curl zip unzip sqlite3 libsqlite3-dev
</code></pre>
<p>sqlite3 and libsqlite3-dev allow us to build/run SQLite queries in our C++ code.</p>
<h3>Set Up vcpkg</h3>
<p>vcpkg is the C++ package manager that we’ll use to install opentelemetry-cpp client.</p>
<pre><code>    # Clone vcpkg
    cd ~
    git clone https://github.com/microsoft/vcpkg.git
</code></pre>
<pre><code>    # Bootstrap
    cd ~/vcpkg
    ./bootstrap-vcpkg.sh
</code></pre>
<h3>Install OpenTelemetry C++ with OTLP gRPC</h3>
<p>In this guide we focus on trace export to Elastic. At time of writing, vcpkg’s opentelemetry-cpp</p>
<p>version 1.18.0 fully supports traces but has limited direct metrics exporting.</p>
<h3>Install the package</h3>
<pre><code>    cd ~/vcpkg
    ./vcpkg install opentelemetry-cpp[otlp-grpc]:x64-linux
</code></pre>
<h4>Note</h4>
<p>Sometimes when installing opentelemetry-cpp on linux it doesn't install all the required packages. As a workaround if you run into that case, try running again but pass a flag to allow-unsupported:</p>
<pre><code>    ./vcpkg install opentelemetry-cpp[*]:x64-linux --allow-unsupported
</code></pre>
<h3>Verify</h3>
<pre><code>    ./vcpkg list | grep opentelemetry-cpp
</code></pre>
<p>The output thould be something like this:</p>
<pre><code>opentelemetry-cpp:x64-linux 1.18.0
</code></pre>
<h2>Create the C++ Project with Database Spans</h2>
<p>We’ll build a sample in ~/otel-app that:</p>
<ul>
<li>
<p>Uses SQLite to do basic CREATE/INSERT/SELECT queries. This is helpful to showcase capturing transactions for apps that use databases on Elastic APM.</p>
</li>
<li>
<p>Generate random traces to showcase how they are captured on Elastic APM.</p>
</li>
</ul>
<p>This app is going to generate random queries where some will contain database transactions and some are just application traces. Each query is contained in a child span, so they appear in APM as separate database transactions.</p>
<pre><code># Below is the structure of our project
</code></pre>
<pre><code>    otel-app/
    ├── main.cpp
    └── CMakeLists.txt
</code></pre>
<h3>Create App Project</h3>
<pre><code>    cd ~
    mkdir otel-app
    cd otel-app
</code></pre>
<p>Inside this project we will create two files</p>
<ul>
<li>
<p>main.cpp</p>
</li>
<li>
<p>CMakeLists.txt</p>
</li>
</ul>
<p>Keep in mind that main.cpp is where you are going to pass the otel exporters that are going to send data to the Elastic cluster. So for your tech stack it would be your application's source code.</p>
<h4>Sample application code</h4>
<pre><code>    main.cpp
    // Below we declare required libraries that we will be using to ship
    // traces to Elastic APM
    #include &lt;opentelemetry/exporters/otlp/otlp_grpc_exporter.h&gt;
    #include &lt;opentelemetry/sdk/trace/tracer_provider.h&gt;
    #include &lt;opentelemetry/sdk/trace/simple_processor.h&gt;
    #include &lt;opentelemetry/trace/provider.h&gt;

    #include &lt;sqlite3.h&gt;
    #include &lt;chrono&gt;
    #include &lt;iostream&gt;
    #include &lt;thread&gt;
    #include &lt;cstdlib&gt;  // for rand(), srand()
    #include &lt;ctime&gt;    // for time()

    // Namespace aliases
    namespace trace_api = opentelemetry::trace;
    namespace sdktrace  = opentelemetry::sdk::trace;
    namespace otlp      = opentelemetry::exporter::otlp;

    // Below we are using a helper function to run SQLITE statement inside 
    // child span
    bool ExecuteSql(sqlite3 *db, const std::string &amp;sql,
                    trace_api::Tracer &amp;tracer,
                    const std::string &amp;span_name)
    {
      // Starting the child span
      auto db_span = tracer.StartSpan(span_name);
      {
        auto scope = tracer.WithActiveSpan(db_span);

        // Here we mark Database attributes for clarity in APM
        db_span-&gt;SetAttribute(&quot;db.system&quot;, &quot;sqlite&quot;);
        db_span-&gt;SetAttribute(&quot;db.statement&quot;, sql);

        char *errMsg = nullptr;
        int rc = sqlite3_exec(db, sql.c_str(), nullptr, nullptr, &amp;errMsg);
        if (rc != SQLITE_OK)
        {
          db_span-&gt;AddEvent(&quot;SQLite error: &quot; + std::string(errMsg ? errMsg : &quot;unknown&quot;));
          sqlite3_free(errMsg);
          db_span-&gt;End();
          return false;
        }
        db_span-&gt;AddEvent(&quot;Query OK&quot;);
      }
      db_span-&gt;End();
      return true;
    }

    /**
     * DoNonDbWork - Simulate some other operation
     */
    void DoNonDbWork(trace_api::Tracer &amp;tracer, const std::string &amp;span_name)
    {
      auto child_span = tracer.StartSpan(span_name);
      {
        auto scope = tracer.WithActiveSpan(child_span);
        // Just sleep or do some &quot;fake&quot; work
        std::cout &lt;&lt; &quot;[TRACE] Doing non-DB work for &quot; &lt;&lt; span_name &lt;&lt; &quot;...\n&quot;;
        std::this_thread::sleep_for(std::chrono::milliseconds(200 + rand() % 300));
        child_span-&gt;AddEvent(&quot;Finished non-DB work&quot;);
      }
      child_span-&gt;End();
    }

    int main()
    {
      // Seed random generator for example
      srand(static_cast&lt;unsigned&gt;(time(nullptr)));

      // 1) Create OTLP exporter for traces
      otlp::OtlpGrpcExporterOptions opts;
      auto exporter = std::make_unique&lt;otlp::OtlpGrpcExporter&gt;(opts);

      // 2) Simple Span Processor
      auto processor = std::make_unique&lt;sdktrace::SimpleSpanProcessor&gt;(std::move(exporter));

      // 3) Tracer Provider
      auto sdk_tracer_provider = std::make_shared&lt;sdktrace::TracerProvider&gt;(std::move(processor));
      auto tracer = sdk_tracer_provider-&gt;GetTracer(&quot;my-cpp-multi-app&quot;);

      // Prepare an in-memory SQLite DB (for random DB usage)
      sqlite3 *db = nullptr;
      int rc = sqlite3_open(&quot;:memory:&quot;, &amp;db);
      if (rc == SQLITE_OK)
      {
        // Create a table so we can do inserts/reads
        ExecuteSql(db, &quot;CREATE TABLE IF NOT EXISTS items (id INTEGER PRIMARY KEY, info TEXT);&quot;,
                   *tracer.get(), &quot;db_create_table&quot;);
      }

      // Create the following loop to generate multiple transactions
      int num_transactions = 5;  // Change this variable to the desired number of transaction
      for (int i = 1; i &lt;= num_transactions; i++)
      {
        // Each iteration is a top-level transaction
        std::string transaction_name = &quot;transaction_&quot; + std::to_string(i);
        auto parent_span = tracer-&gt;StartSpan(transaction_name);
        {
          auto scope = tracer-&gt;WithActiveSpan(parent_span);

          std::cout &lt;&lt; &quot;\n=== Starting &quot; &lt;&lt; transaction_name &lt;&lt; &quot; ===\n&quot;;

          // Randomly select whether a transaction will interact with the database or not.
          bool doDb = (rand() % 2 == 0); // 50% chance

          if (doDb &amp;&amp; db)
          {
            // Insert random data
            std::string insert_sql = &quot;INSERT INTO items (info) VALUES ('Item &quot; + std::to_string(i) + &quot;');&quot;;
            ExecuteSql(db, insert_sql, *tracer.get(), &quot;db_insert_item&quot;);

            // Select from DB
            ExecuteSql(db, &quot;SELECT * FROM items;&quot;, *tracer.get(), &quot;db_select_items&quot;);
          }
          else
          {
            // Do some random non-DB tasks
            DoNonDbWork(*tracer.get(), &quot;non_db_task_1&quot;);
            DoNonDbWork(*tracer.get(), &quot;non_db_task_2&quot;);
          }

          // Sleep a little to simulate transaction time
          std::this_thread::sleep_for(std::chrono::milliseconds(200));
        }
        parent_span-&gt;End();
      }

      // Close DB
      sqlite3_close(db);

      // Extra sleep to ensure final flush
      std::cout &lt;&lt; &quot;\n[INFO] Sleeping 5 seconds to allow flush...\n&quot;;
      std::this_thread::sleep_for(std::chrono::seconds(5));
      std::cout &lt;&lt; &quot;[INFO] Exiting.\n&quot;;
      return 0;
    }
</code></pre>
<h5>What does the code do?</h5>
<p>We create 5 top-level “transaction_i” spans.</p>
<p>For each transaction, we randomly choose to do DB or non-DB work</p>
<pre><code>- If DB: Insert a row, then select. Each is a child span.

- If non-DB: We do two “fake tasks” (child spans).
</code></pre>
<p>Once we finish, we close the database connection and wait 5 seconds for data flush.</p>
<h4>Sample instruction file</h4>
<p>CMakeLists.txt : This file contains instructions describing the source files and targets.</p>
<pre><code>    cmake_minimum_required(VERSION 3.10)
    project(OtelApp VERSION 1.0)

    set(CMAKE_CXX_STANDARD 11)
    set(CMAKE_CXX_STANDARD_REQUIRED ON)

    # Here we are pointing to use the vcpkg toolchain
    set(CMAKE_TOOLCHAIN_FILE &quot;PATH-TO/vcpkg.cmake&quot; CACHE STRING &quot;Vcpkg toolchain file&quot;)

    find_package(opentelemetry-cpp CONFIG REQUIRED)

    add_executable(otel_app main.cpp)

    # Below we are linking the OTLP gRPC exporter, trace library, and sqlite3
    target_link_libraries(otel_app PRIVATE
        opentelemetry-cpp::otlp_grpc_exporter
        opentelemetry-cpp::trace
        sqlite3
    )
</code></pre>
<h4>Declare Environmental Variables</h4>
<p>Here we are going to export our Elastic Cloud endpoints as environmental variables</p>
<p>You can get that information by doing the following:</p>
<ol>
<li>
<p>Login into your elastic cloud</p>
</li>
<li>
<p>Go into your deployment</p>
</li>
<li>
<p>On the Left hand side, click on the hamburger menu and scroll down to “Integrations”</p>
</li>
<li>
<p>Go on the search bar inside the integration and type “APM”</p>
</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-cpp-elastic/APM-Search.png" alt="" /></p>
<ol start="5">
<li>
<p>Click on the APM integration</p>
</li>
<li>
<p>Scroll down and click on the OpenTelemetry Option on the far left side</p>
</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-cpp-elastic/highlighted.png" alt="" /></p>
<ol start="7">
<li>You should be able to see values similar to the screenshot below. Once you copy the values to export, click on launch APM.</li>
</ol>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-cpp-elastic/highlighted2.png" alt="" /></p>
<p>As you copy the required values, go ahead and export them.</p>
<pre><code>    export OTEL_EXPORTER_OTLP_ENDPOINT=&quot;APM-ENDPOINT&quot;
    export OTEL_EXPORTER_OTLP_HEADERS=&quot;KEY&quot;
    export OTEL_RESOURCE_ATTRIBUTES=&quot;service.name=my-app,service.version=1.0.0,deployment.environment=dev&quot;
</code></pre>
<p>Note that the elastic OTEL_EXPORTER_OTLP_HEADERS value usually starts with “Authorization=Bearer” make sure that you convert the upper case “A” in authorization to a lower case “a”. This is due to the fact that the otel header exporter expects a lower case “a” for authorization.</p>
<h3>Build and Run</h3>
<p>Once we create the two files we then move to building the application.</p>
<pre><code>cd ~/otel-app
mkdir -p build
cd build

cmake -DCMAKE_TOOLCHAIN_FILE=~/vcpkg/scripts/buildsystems/vcpkg.cmake \
      -DCMAKE_PREFIX_PATH=~/vcpkg/installed/x64-linux/share \
      ..
make
</code></pre>
<p>Once make is successful run the the application</p>
<pre><code>./otel-app
</code></pre>
<p>You should be able to see the script execute with a similar console output</p>
<pre><code>    Console outcome:
    === Starting transaction_1 ===
    [TRACE] Doing non-DB work for non_db_task_1...
    [TRACE] Doing non-DB work for non_db_task_2...

    === Starting transaction_2 ===
    [TRACE] Doing DB work for doDb_task_1...
    [TRACE] Doing DB work for doDb_task_2...

    === Starting transaction_3 ===
    [TRACE] Doing non-DB work for non_db_task_1...
    [TRACE] Doing non-DB work for non_db_task_2...

    === Starting transaction_4 ===
    [TRACE] Doing non-DB work for non_db_task_1...
    [TRACE] Doing non-DB work for non_db_task_2...

    === Starting transaction_5 ===
    [TRACE] Doing non-DB work for non_db_task_1...
    [TRACE] Doing non-DB work for non_db_task_2...

    [INFO] Sleeping 5 seconds to allow flush...
    [INFO] Exiting.
</code></pre>
<p>Once the script executes you should be able to observe those traces on Elastic APM similar to the screenshots below.</p>
<h3>Observe in Elastic APM</h3>
<p>Go to Elastic Cloud, open your deployment, and navigate to Observability &gt; APM.</p>
<p>Look for the app name in the service list (as defined by OTEL_RESOURCE_ATTRIBUTES).</p>
<p>Inside that service’s Traces tab, you’ll find multiple transactions like “transaction_1”,</p>
<p>“transaction_2”, etc.</p>
<p>Expanding each transaction shows child spans:</p>
<pre><code>- Possibly db_insert_item and db_select_items if random DB path was taken.

- Otherwise, non_db_task_1 and non_db_task_2.
</code></pre>
<p>You can see how some transactions do DB calls, some do not, each with different spans.</p>
<p>This variety demonstrates how your real application might produce multiple different</p>
<p>“routes” or “operations.”</p>
<h4>Service Map</h4>
<p>If everything runs correctly, you should be able to view your services and see service maps for your application.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-cpp-elastic/Service-Map.png" alt="" /></p>
<h4>Services</h4>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-cpp-elastic/Services.png" alt="" /></p>
<h4>My Elastic App</h4>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-cpp-elastic/Overview-transactions.png" alt="" /></p>
<h4>App Transactions</h4>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-cpp-elastic/Transactions2.png" alt="" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-cpp-elastic/Trace-db.png" alt="" /></p>
<h4>Dependencies</h4>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-cpp-elastic/Dependecies.png" alt="" /></p>
<h4>Logs</h4>
<p>Navigate to your logs window/Discover to see the incoming application logs</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-cpp-elastic/Logs.png" alt="" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-cpp-elastic/Logs2.png" alt="" /></p>
<h4>Patterns</h4>
<p>Log pattern analysis helps you to find patterns in unstructured log messages and makes it easier to examine your data.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-cpp-elastic/patt2.png" alt="" /></p>
<h2>Final Recap</h2>
<p>Here is a quick summary of what we did:</p>
<ul>
<li>
<p>Provisioned an Ubuntu 22.04 machine.</p>
</li>
<li>
<p>Installed build tools for SQLite, dev libs, and vcpkg.</p>
</li>
<li>
<p>Installed the client for opentelemetry-cpp via vcpkg.</p>
</li>
<li>
<p>Created a minimal C++ project that executes app traces and captures database operations.</p>
</li>
<li>
<p>Connected database sqlite3 in CMakeLists.txt.</p>
</li>
<li>
<p>Exported the Elastic OTLP endpoint &amp; token as environment variables (with a lowercase authorization=Bearer key!).</p>
</li>
<li>
<p>Ran the application and observed DB interactions and app traces in Elastic APM.</p>
</li>
<li>
<p>Observed application logs and patterns on Elastic logs and Discover.</p>
</li>
</ul>
<h2>FAQ &amp; Common Issues</h2>
<ul>
<li>Getting “Could not find package configuration file provided by opentelemetry-cpp”?</li>
</ul>
<p>Make sure you pass</p>
<pre><code>-DCMAKE_TOOLCHAIN_FILE=... and -DCMAKE_PREFIX_PATH=... 
</code></pre>
<p>to cmake, or embed them in CMakeLists.txt.</p>
<ul>
<li>Crash: “validate_metadata: INTERNAL:Illegal header key”?</li>
</ul>
<p>Use all-lowercase in</p>
<pre><code>OTEL_EXPORTER_OTLP_HEADERS, e.g. authorization=Bearer \&lt;token&gt;.
</code></pre>
<ul>
<li>Missing otlp_grpc_metrics_exporter.h?</li>
</ul>
<p>Your vcpkg version of opentelemetry-cpp (1.18.0) lacks a direct metrics exporter for OTLP. For metrics, either upgrade the library or consider an OpenTelemetry Collector approach.</p>
<ul>
<li>No data in Elastic APM?</li>
</ul>
<p>Double-check your endpoint URL, Bearer token, firewall rules, or service name in the APM</p>
<h2>Additional Resources:</h2>
<ul>
<li><a href="https://cloud.elastic.co/registration">Sign up for Elastic Cloud free trial</a></li>
<li><a href="https://www.elastic.co/observability-labs/blog/tag/opentelemetry">More Elastic OpenTelemetry Topics</a></li>
<li><a href="https://www.elastic.co/observability-labs/blog/elastic-distributions-opentelemetry">Introducing Elastic Distributions of OpenTelemetry</a></li>
<li><a href="https://www.elastic.co/observability-labs/blog/elastic-distribution-opentelemetry-collector">Introducing Elastic Distribution of OpenTelemetry Collector</a></li>
<li><a href="https://www.elastic.co/observability-labs/blog/elastic-opentelemetry-openai">Instrumenting your OpenAI- powered Python, Node.js, and Java Applications with EDOT</a></li>
</ul>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/opentelemetry-cpp-elastic/blog-image.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Optimizing Observability with ES|QL: Streamlining SRE operations and issue resolution for Kubernetes and OTel]]></title>
            <link>https://www.elastic.co/observability-labs/blog/opentelemetry-kubernetes-esql</link>
            <guid isPermaLink="false">opentelemetry-kubernetes-esql</guid>
            <pubDate>Wed, 01 Nov 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[ES|QL enhances operational efficiency, data analysis, and issue resolution for SREs. This blog covers the advantages of ES|QL in Elastic Observability and how it can apply to managing issues instrumented with OpenTelemetry and running on Kubernetes.]]></description>
            <content:encoded><![CDATA[<p>As an operations engineer (SRE, IT Operations, DevOps), managing technology and data sprawl is an ongoing challenge. Simply managing the large volumes of high dimensionality and high cardinality data is overwhelming.</p>
<p>As a single platform, Elastic® helps SREs unify and correlate limitless telemetry data, including metrics, logs, traces, and profiling, into a single datastore — Elasticsearch®. By then applying the power of Elastic’s advanced machine learning (ML), AIOps, AI Assistant, and analytics, you can break down silos and turn data into insights. As a full-stack observability solution, everything from infrastructure monitoring to log monitoring and application performance monitoring (APM) can be found in a single, unified experience.</p>
<p>In Elastic 8.11, a technical preview is now available of <a href="https://www.elastic.co/blog/esql-elasticsearch-piped-query-language">Elastic’s new piped query language, ES|QL (Elasticsearch Query Language)</a>, which transforms, enriches, and simplifies data investigations. Powered by a new query engine, ES|QL delivers advanced search capabilities with concurrent processing, improving speed and efficiency, irrespective of data source and structure. Accelerate resolution by creating aggregations and visualizations from one screen, delivering an iterative, uninterrupted workflow.</p>
<h2>Advantages of ES|QL for SREs</h2>
<p>SREs using Elastic Observability can leverage ES|QL to analyze logs, metrics, traces, and profiling data, enabling them to pinpoint performance bottlenecks and system issues with a single query. SREs gain the following advantages when managing high dimensionality and high cardinality data with ES|QL in Elastic Observability:</p>
<ul>
<li><strong>Improved operational efficiency:</strong> By using ES|QL, SREs can create more actionable notifications with aggregated values as thresholds from a single query, which can also be managed through the Elastic API and integrated into DevOps processes.</li>
<li><strong>Enhanced analysis with insights:</strong> ES|QL can process diverse observability data, including application, infrastructure, business data, and more, regardless of the source and structure. ES|QL can easily enrich the data with additional fields and context, allowing the creation of visualizations for dashboards or issue analysis with a single query.</li>
<li><strong>Reduced mean time to resolution:</strong> ES|QL, when combined with Elastic Observability's AIOps and AI Assistant, enhances detection accuracy by identifying trends, isolating incidents, and reducing false positives. This improvement in context facilitates troubleshooting and the quick pinpointing and resolution of issues.</li>
</ul>
<p>ES|QL in Elastic Observability not only enhances an SRE's ability to manage the customer experience, an organization's revenue, and SLOs more effectively but also facilitates collaboration with developers and DevOps by providing contextualized aggregated data.</p>
<p>In this blog, we will cover some of the key use cases SREs can leverage with ES|QL:</p>
<ul>
<li>ES|QL integrated with the Elastic AI Assistant, which uses public LLM and private data, enhances the analysis experience anywhere in Elastic Observability.</li>
<li>SREs can, in a single ES|QL query, break down, analyze, and visualize observability data from multiple sources and across any time frame.</li>
<li>Actionable alerts can be easily created from a single ES|QL query, enhancing operations.</li>
</ul>
<p>I will work through these use cases by showcasing how an SRE can solve a problem in an application instrumented with OpenTelemetry and running on Kubernetes. The OpenTelemetry (OTel) demo is on an Amazon EKS cluster, with Elastic Cloud 8.11 configured.</p>
<p>You can also check out our <a href="https://www.youtube.com/watch?v=vm0pBWI2l9c">Elastic Observability ES|QL Demo</a>, which walks through ES|QL functionality for Observability.</p>
<h2>ES|QL with AI Assistant</h2>
<p>As an SRE, you are monitoring your OTel instrumented application with Elastic Observability, and while in Elastic APM, you notice some issues highlighted in the service map.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-kubernetes-esql/elastic-blog-1-services.png" alt="1 - services" /></p>
<p>Using Elastic AI Assistant, you can easily ask for analysis, and in particular, we check on what the overall latency is across the application services.</p>
<pre><code class="language-plaintext">My APM data is in traces-apm*. What's the average latency per service over the last hour? Use ESQL, the data is mapped to ECS
</code></pre>
&lt;Video vidyardUuid=&quot;wHJpzouDQHB51UftmkHFyo&quot; /&gt;
<p>The Elastic AI Assistant generates an ES|QL query, which we run in the AI Assistant to get a list of the average latencies across all the application services. We can easily see the top four are:</p>
<ul>
<li>load generator</li>
<li>front-end proxy</li>
<li>frontendservice</li>
<li>checkoutservice</li>
</ul>
<p>With a simple natural language query in the AI Assistant, it generated a single ES|QL query that helped list out the latencies across the services.</p>
<p>Noticing that there is an issue with several services, we decide to start with the frontend proxy. As we work through the details, we see significant failures, and through <strong>Elastic APM failure correlation</strong> , it becomes apparent that the frontend proxy is not properly completing its calls to downstream services.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-kubernetes-esql/elastic-blog-2-failed-transaction.png" alt="2 - failed transaction" /></p>
<h2>ES|QL insightful and contextual analysis in Discover</h2>
<p>Knowing that the application is running on Kubernetes, we investigate if there are issues in Kubernetes. In particular, we want to see if there are any services having issues.</p>
<p>We use the following query in ES|QL in Elastic Discover:</p>
<pre><code class="language-sql">from metrics-* | where kubernetes.container.status.last_terminated_reason != &quot;&quot; and kubernetes.namespace == &quot;default&quot; | stats reason_count=count(kubernetes.container.status.last_terminated_reason) by kubernetes.container.name, kubernetes.container.status.last_terminated_reason | where reason_count &gt; 0
</code></pre>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-kubernetes-esql/elastic-blog-3-two-horizontal-bar-graphs.png" alt="3 - horizontal graph" /></p>
<p>ES|QL helps analyze 1,000s/10,000s of metric events from Kubernetes and highlights two services that are restarting due to OOMKilled.</p>
<p>The Elastic AI Assistant, when asked about OOMKilled, indicates that a container in a pod was killed due to an out-of-memory condition.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-kubernetes-esql/elastic-blog-4-understanding-oomkilled.png" alt="4 - understanding oomkilled" /></p>
<p>We run another ES|QL query to understand the memory usage for emailservice and productcatalogservice.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-kubernetes-esql/elastic-blog-5-split-bar-graphs.png" alt="5 - split bar graphs" /></p>
<p>ES|QL easily found the average memory usage fairly high.</p>
<p>We can now further investigate both of these services’ logs, metrics, and Kubernetes-related data. However, before we continue, we create an alert to track heavy memory usage.</p>
<h2>Actionable alerts with ES|QL</h2>
<p>Suspecting a specific issue, that might recur, we simply create an alert that brings in the ES|QL query we just ran that will track for any service that exceeds 50% in memory utilization.</p>
<p>We modify the last query to find any service with high memory usage:</p>
<pre><code class="language-sql">FROM metrics*
| WHERE @timestamp &gt;= NOW() - 1 hours
| STATS avg_memory_usage = AVG(kubernetes.pod.memory.usage.limit.pct) BY kubernetes.deployment.name | where avg_memory_usage &gt; .5
</code></pre>
<p>With that query, we create a simple alert. Notice how the ES|QL query is brought into the alert. We simply connect this to pager duty. But we can choose from multiple connectors like ServiceNow, Opsgenie, email, etc.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-kubernetes-esql/elastic-blog-6-create-rule.png" alt="6 - create rule" /></p>
<p>With this alert, we can now easily monitor for any services that exceed 50% memory utilization in their pods.</p>
<h2>Make the most of your data with ES|QL</h2>
<p>In this post, we demonstrated the power ES|QL brings to analysis, operations, and reducing MTTR. In summary, the three use cases with ES|QL in Elastic Observability are as follows:</p>
<ul>
<li>ES|QL integrated with the Elastic AI Assistant, which uses public LLM and private data, enhances the analysis experience anywhere in Elastic Observability.</li>
<li>SREs can, in a single ES|QL query, break down, analyze, and visualize observability data from multiple sources and across any time frame.</li>
<li>Actionable alerts can be easily created from a single ES|QL query, enhancing operations.</li>
</ul>
<p>Elastic invites SREs and developers to experience this transformative language firsthand and unlock new horizons in their data tasks. Try it today at <a href="https://ela.st/free-trial">https://ela.st/free-trial</a> now in technical preview.</p>
<blockquote>
<ul>
<li><a href="https://www.elastic.co/demo-gallery/observability">Elastic Observability Tour</a></li>
<li><a href="https://www.elastic.co/blog/log-management-observability-operations">The power of effective log management</a></li>
<li><a href="https://www.elastic.co/blog/context-aware-insights-elastic-ai-assistant-observability">Transforming Observability with the AI Assistant</a></li>
<li><a href="https://www.elastic.co/blog/esql-elasticsearch-piped-query-language">ES|QL announcement blog</a></li>
</ul>
</blockquote>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/opentelemetry-kubernetes-esql/ES_QL_blog-720x420-05.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Using NLP and Pattern Matching to Detect, Assess, and Redact PII in Logs - Part 1]]></title>
            <link>https://www.elastic.co/observability-labs/blog/pii-ner-regex-assess-redact-part-1</link>
            <guid isPermaLink="false">pii-ner-regex-assess-redact-part-1</guid>
            <pubDate>Wed, 25 Sep 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[How to detect and assess PII in your logs using Elasticsearch and NLP]]></description>
            <content:encoded><![CDATA[<h2>Introduction:</h2>
<p>The prevalence of high-entropy logs in distributed systems has significantly raised the risk of PII (Personally Identifiable Information) seeping into our logs, which can result in security and compliance issues. This 2-part blog delves into the crucial task of identifying and managing this issue using the Elastic Stack. We will explore using NLP (Natural Language Processing) and Pattern matching to detect, assess, and, where feasible, redact PII from logs that are being ingested into Elasticsearch.</p>
<p>In <strong>Part 1</strong> of this blog, we will cover the following:</p>
<ul>
<li>Review the techniques and tools we have available to manage PII in our logs</li>
<li>Understand the roles of NLP / NER in PII detection</li>
<li>Build a composable processing pipeline to detect and assess PII</li>
<li>Sample logs and run them through the NER Model</li>
<li>Assess the results of the NER Model</li>
</ul>
<p>In <a href="https://www.elastic.co/observability-labs/blog/pii-ner-regex-assess-redact-part-2">Part 2 of this blog</a> of this blog, we will cover the following:</p>
<ul>
<li>Redact PII using NER and the redact processor</li>
<li>Apply field-level security to control access to the un-redacted data</li>
<li>Enhance the dashboards and alerts</li>
<li>Production considerations and scaling</li>
<li>How to run these processes on incoming or historical data</li>
</ul>
<p>Here is the overall flow we will construct over the 2 blogs:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/pii-ner-regex-assess-redact-part-1/pii-overall-flow.png" alt="PII Overall Flow" /></p>
<p>All code for this exercise can be found at:
<a href="https://github.com/bvader/elastic-pii">https://github.com/bvader/elastic-pii</a>.</p>
<h2>Tools and Techniques</h2>
<p>There are four general capabilities that we will use for this exercise.</p>
<ul>
<li>Named Entity Recognition Detection (NER)</li>
<li>Pattern Matching Detection</li>
<li>Log Sampling</li>
<li>Ingest Pipelines as Composable Processing</li>
</ul>
<h4>Named Entity Recognition (NER) Detection</h4>
<p>NER is a sub-task of Natural Language Processing (NLP) that involves identifying and categorizing named entities in unstructured text into predefined categories such as:</p>
<ul>
<li>Person: Names of individuals, including celebrities, politicians, and historical figures.</li>
<li>Organization: Names of companies, institutions, and organizations.</li>
<li>Location: Geographic locations, including cities, countries, and landmarks.</li>
<li>Event: Names of events, including conferences, meetings, and festivals.</li>
</ul>
<p>For our use PII case, we will choose the base BERT NER model <a href="https://huggingface.co/dslim/bert-base-NER">bert-base-NER</a> that can be downloaded from <a href="https://huggingface.co">Hugging Face</a> and loaded into Elasticsearch as a trained model.</p>
<p><strong>Important Note:</strong>  NER / NLP Models are CPU-intensive and expensive to run at scale; thus, we will want to employ a sampling technique to understand the risk in our logs without sending the full logs volume through the NER Model. We will discuss the performance and scaling of the NER model in part 2 of the blog.</p>
<h4>Pattern Matching Detection</h4>
<p>In addition to using an NER, regex pattern matching is a powerful tool for detecting and redacting PII based on common patterns. The Elasticsearch <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/redact-processor.html">redact</a> processor is built for this use case.</p>
<h4>Log Sampling</h4>
<p>Considering the performance implications of NER and the fact that we may be ingesting a large volume of logs into Elasticsearch, it makes sense to sample our incoming logs. We will build a simple log sampler to accomplish this.</p>
<h4>Ingest Pipelines as Composable Processing</h4>
<p>We will create several pipelines, each focusing on a specific capability and a main ingest pipeline to orchestrate the overall process.</p>
<h2>Building the Processing Flow</h2>
<h4>Logs Sampling + Composable Ingest Pipelines</h4>
<p>The first thing we will do is set up a sampler to sample our logs. This ingest pipeline simply takes a sampling rate between 0 (no log) and 10000 (all logs), which allows as low as ~0.01% sampling rate and marks the sampled logs with <code>sample.sampled: true</code>. Further processing on the logs will be driven by the value of <code>sample.sampled</code>. The <code>sample.sample_rate</code> can be set here or &quot;passed in&quot; from the orchestration pipeline.</p>
<p>The command should be run from the Kibana -&gt; Dev Tools</p>
<p><a href="https://github.com/bvader/elastic-pii/blob/main/elastic/blog-part-1/logs-sampler-composable-pipelines-part-1.json">The code can be found here</a> for the following three sections of code.</p>
&lt;details open&gt;
  &lt;summary&gt;logs-sampler pipeline code - click to open/close&lt;/summary&gt;
<pre><code class="language-bash"># logs-sampler pipeline - part 1
DELETE _ingest/pipeline/logs-sampler
PUT _ingest/pipeline/logs-sampler
{
  &quot;processors&quot;: [
    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Set Sampling Rate 0 None 10000 all allows for 0.01% precision&quot;,
        &quot;if&quot;: &quot;ctx.sample.sample_rate == null&quot;,
        &quot;field&quot;: &quot;sample.sample_rate&quot;,
        &quot;value&quot;: 10000
      }
    },
    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Determine if keeping unsampled docs&quot;,
        &quot;if&quot;: &quot;ctx.sample.keep_unsampled == null&quot;,
        &quot;field&quot;: &quot;sample.keep_unsampled&quot;,
        &quot;value&quot;: true
      }
    },
    {
      &quot;set&quot;: {
        &quot;field&quot;: &quot;sample.sampled&quot;,
        &quot;value&quot;: false
      }
    },
    {
      &quot;script&quot;: {
        &quot;source&quot;: &quot;&quot;&quot; Random r = new Random();
        ctx.sample.random = r.nextInt(params.max); &quot;&quot;&quot;,
        &quot;params&quot;: {
          &quot;max&quot;: 10000
        }
      }
    },
    {
      &quot;set&quot;: {
        &quot;if&quot;: &quot;ctx.sample.random &lt;= ctx.sample.sample_rate&quot;,
        &quot;field&quot;: &quot;sample.sampled&quot;,
        &quot;value&quot;: true
      }
    },
    {
      &quot;drop&quot;: {
         &quot;description&quot;: &quot;Drop unsampled document if applicable&quot;,
        &quot;if&quot;: &quot;ctx.sample.keep_unsampled == false &amp;&amp; ctx.sample.sampled == false&quot;
      }
    }
  ]
}
</code></pre>
&lt;/details&gt;
<p>Now, let's test the logs sampler. We will build the first part of the composable pipeline. We will be sending logs to the logs-generic-default data stream. With that in mind, we will create the <code>logs@custom</code> ingest pipeline that will be automatically called using the logs <a href="https://www.elastic.co/guide/en/fleet/current/data-streams.html#data-streams-pipelines">data stream framework</a> for customization. We will add one additional level of abstraction so that you can apply this PII processing to other data streams.</p>
<p>Next, we will create the <code>process-pii</code> pipeline. This is the core processing pipeline where we will orchestrate PII processing component pipelines. In this first step, we will simply apply the sampling logic. Note that we are setting the sampling rate to 100, which is equivalent to 10% of the logs.</p>
&lt;details open&gt;
  &lt;summary&gt;process-pii pipeline code - click to open/close&lt;/summary&gt;
<pre><code class="language-bash"># Process PII pipeline - part 1
DELETE _ingest/pipeline/process-pii
PUT _ingest/pipeline/process-pii
{
  &quot;processors&quot;: [
    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Set true if enabling sampling, otherwise false&quot;,
        &quot;field&quot;: &quot;sample.enabled&quot;,
        &quot;value&quot;: true
      }
    },
    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Set Sampling Rate 0 None 10000 all allows for 0.01% precision&quot;,
        &quot;field&quot;: &quot;sample.sample_rate&quot;,
        &quot;value&quot;: 1000
      }
    },
    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Set to false if you want to drop unsampled data, handy for reindexing hostorical data&quot;,
        &quot;field&quot;: &quot;sample.keep_unsampled&quot;,
        &quot;value&quot;: true
      }
    },
    {
      &quot;pipeline&quot;: {
        &quot;if&quot;: &quot;ctx.sample.enabled == true&quot;,
        &quot;name&quot;: &quot;logs-sampler&quot;,
        &quot;ignore_failure&quot;: true
      }
    }
  ]
}
</code></pre>
&lt;/details&gt;
<p>Finally, we create the logs <code>logs@custom</code>, which will simply call our <code>process-pii</code> pipeline based on the correct <code>data_stream.dataset</code></p>
&lt;details open&gt;
  &lt;summary&gt;logs@custom pipeline code - click to open/close&lt;/summary&gt;
<pre><code class="language-bash"># logs@custom pipeline - part 1
DELETE _ingest/pipeline/logs@custom
PUT _ingest/pipeline/logs@custom
{
  &quot;processors&quot;: [
    {
      &quot;set&quot;: {
        &quot;field&quot;: &quot;pipelinetoplevel&quot;,
        &quot;value&quot;: &quot;logs@custom&quot;
      }
    },
        {
      &quot;set&quot;: {
        &quot;field&quot;: &quot;pipelinetoplevelinfo&quot;,
        &quot;value&quot;: &quot;{{{data_stream.dataset}}}&quot;
      }
    },
    {
      &quot;pipeline&quot;: {
        &quot;description&quot; : &quot;Call the process_pii pipeline on the correct dataset&quot;,
        &quot;if&quot;: &quot;ctx?.data_stream?.dataset == 'pii'&quot;, 
        &quot;name&quot;: &quot;process-pii&quot;
      }
    }
  ]
}
</code></pre>
&lt;/details&gt;
<p>Now, let's test to see the sampling at work.</p>
<p>Load the data as described here <a href="#data-loading-appendix">Data Loading Appendix</a>. Let's use the sample data first, and we will talk about how to test with your incoming or historical logs later at the end of this blog.</p>
<p>If you look at Observability -&gt; Logs -&gt; Logs Explorer with KQL filter <code>data_stream.dataset : pii</code> and Breakdown by sample.sampled, you should see the breakdown to be approximately 10%</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/pii-ner-regex-assess-redact-part-1/pii-discover-1-part-1.png" alt="PII Discover 1" /></p>
<p>At this point we have a composable ingest pipeline that is &quot;sampling&quot; logs. As a bonus, you can use this logs sampler for any other use cases you have as well.</p>
<h4>Loading, Configuration, and Execution of the NER Pipeline</h4>
<h5>Loading the NER Model</h5>
<p>You will need a Machine Learning node to run the NER model on. In this exercise, we are using <a href="https://www.elastic.co/guide/en/cloud/current/ec-getting-started.html">Elastic Cloud Hosted Deployment </a>on AWS with the <a href="https://www.elastic.co/guide/en/cloud/current/ec_selecting_the_right_configuration_for_you.html">CPU Optimized (ARM)</a> architecture. The NER inference will run on a Machine Learning AWS c5d node. There will be GPU options in the future, but today, we will stick with CPU architecture.</p>
<p>This exercise will use a single c5d with 8 GB RAM with 4.2 vCPU up to 8.4 vCPU</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/pii-ner-regex-assess-redact-part-1/pii-ml-node-part-1.png" alt="ML Node" /></p>
<p>Please refer to the official documentation on <a href="https://www.elastic.co/guide/en/machine-learning/current/ml-nlp-import-model.html">how to import an NLP-trained model into Elasticsearch</a> for complete instructions on uploading, configuring, and deploying the model.</p>
<p>The quickest way to get the model is using the Eland Docker method.</p>
<p>The following command will load the model into Elasticsearch but will not start it. We will do that in the next step.</p>
<pre><code class="language-bash">docker run -it --rm --network host docker.elastic.co/eland/eland \
  eland_import_hub_model \
  --url https://mydeployment.es.us-west-1.aws.found.io:443/ \
  -u elastic -p password \
  --hub-model-id dslim/bert-base-NER --task-type ner

</code></pre>
<h5>Deploy and Start the NER Model</h5>
<p>In general, to improve ingest performance, increase throughput by adding more allocations to the deployment. For improved search speed, increase the number of threads per allocation.</p>
<p>To scale ingest, we will focus on scaling the allocations for the deployed model. More information on this topic is available <a href="https://www.elastic.co/guide/en/machine-learning/current/ml-nlp-deploy-model.html">here</a>. The number of allocations must be less than the available allocated processors (cores, not vCPUs) per node.</p>
<p>To deploy and start the NER Model. We will do this using the <a href="https://www.elastic.co/guide/en/elasticsearch/reference/8.15/start-trained-model-deployment.html">Start trained model deployment API</a></p>
<p>We will configure the following:</p>
<ul>
<li>4 Allocations to allow for more parallel ingestion</li>
<li>1 Thread per Allocation</li>
<li>0 Byes Cache, as we expect a low cache hit rate</li>
<li>8192 Queue</li>
</ul>
<pre><code># Start the model with 4 Allocators x 1 Thread, no cache, and 8192 queue
POST _ml/trained_models/dslim__bert-base-ner/deployment/_start?cache_size=0b&amp;number_of_allocations=4&amp;threads_per_allocation=1&amp;queue_capacity=8192

</code></pre>
<p>You should get a response that looks something like this.</p>
<pre><code class="language-bash">{
  &quot;assignment&quot;: {
    &quot;task_parameters&quot;: {
      &quot;model_id&quot;: &quot;dslim__bert-base-ner&quot;,
      &quot;deployment_id&quot;: &quot;dslim__bert-base-ner&quot;,
      &quot;model_bytes&quot;: 430974836,
      &quot;threads_per_allocation&quot;: 1,
      &quot;number_of_allocations&quot;: 4,
      &quot;queue_capacity&quot;: 8192,
      &quot;cache_size&quot;: &quot;0&quot;,
      &quot;priority&quot;: &quot;normal&quot;,
      &quot;per_deployment_memory_bytes&quot;: 430914596,
      &quot;per_allocation_memory_bytes&quot;: 629366952
    },
...
    &quot;assignment_state&quot;: &quot;started&quot;,
    &quot;start_time&quot;: &quot;2024-09-23T21:39:18.476066615Z&quot;,
    &quot;max_assigned_allocations&quot;: 4
  }
}
</code></pre>
<p>The NER model has been deployed and started and is ready to be used.</p>
<p>The following ingest pipeline implements the NER model via the <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/inference-processor.html">inference</a> processor.</p>
<p>There is a significant amount of code here, but only two items of interest now exist. The rest of the code is conditional logic to drive some additional specific behavior that we will look closer at in the future.</p>
<ol>
<li>
<p>The inference processor calls the NER model by ID, which we loaded previously, and passes the text to be analyzed, which, in this case, is the message field, which is the text_field we want to pass to the NER model to analyze for PII.</p>
</li>
<li>
<p>The script processor loops through the message field and uses the data generated by the NER model to replace the identified PII with redacted placeholders. This looks more complex than it really is, as it simply loops through the array of ML predictions and replaces them in the message string with constants, and stores the results in a new field <code>redact.message</code>. We will look at this a little closer in the following steps.</p>
</li>
</ol>
<p><a href="https://github.com/bvader/elastic-pii/blob/main/elastic/blog-part-1/logs-sampler-composable-pipelines-part-2.json">The code can be found here</a> for the following three sections of code.</p>
<p>The NER PII Pipeline</p>
&lt;details open&gt;
  &lt;summary&gt;logs-ner-pii-processor pipeline code - click to open/close&lt;/summary&gt;
<pre><code class="language-bash"># NER Pipeline
DELETE _ingest/pipeline/logs-ner-pii-processor
PUT _ingest/pipeline/logs-ner-pii-processor
{
  &quot;processors&quot;: [
    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Set to true to actually redact, false will run processors but leave original&quot;,
        &quot;field&quot;: &quot;redact.enable&quot;,
        &quot;value&quot;: true
      }
    },
    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Set to true to keep ml results for debugging&quot;,
        &quot;field&quot;: &quot;redact.ner.keep_result&quot;,
        &quot;value&quot;: true
      }
    },
    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Set to PER, LOC, ORG to skip, or NONE to not drop any replacement&quot;,
        &quot;field&quot;: &quot;redact.ner.skip_entity&quot;,
        &quot;value&quot;: &quot;NONE&quot;
      }
    },
    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Set to PER, LOC, ORG to skip, or NONE to not drop any replacement&quot;,
        &quot;field&quot;: &quot;redact.ner.minimum_score&quot;,
        &quot;value&quot;: 0
      }
    },
    {
      &quot;set&quot;: {
        &quot;if&quot;: &quot;ctx?.redact?.message == null&quot;,
        &quot;field&quot;: &quot;redact.message&quot;,
        &quot;copy_from&quot;: &quot;message&quot;
      }
    },
    {
      &quot;set&quot;: {
        &quot;field&quot;: &quot;redact.ner.successful&quot;,
        &quot;value&quot;: true
      }
    },
    {
      &quot;set&quot;: {
        &quot;field&quot;: &quot;redact.ner.found&quot;,
        &quot;value&quot;: false
      }
    },
    {
      &quot;inference&quot;: {
        &quot;model_id&quot;: &quot;dslim__bert-base-ner&quot;,
        &quot;field_map&quot;: {
          &quot;message&quot;: &quot;text_field&quot;
        },
        &quot;on_failure&quot;: [
          {
            &quot;set&quot;: {
              &quot;description&quot;: &quot;Set 'error.message'&quot;,
              &quot;field&quot;: &quot;failure&quot;,
              &quot;value&quot;: &quot;REDACT_NER_FAILED&quot;
            }
          },
          {
            &quot;set&quot;: {
              &quot;field&quot;: &quot;redact.ner.successful&quot;,
              &quot;value&quot;: false
            }
          }
        ]
      }
    },
    {
      &quot;script&quot;: {
        &quot;if&quot;: &quot;ctx.failure_ner != 'REDACT_NER_FAILED'&quot;,
        &quot;lang&quot;: &quot;painless&quot;,
        &quot;source&quot;: &quot;&quot;&quot;String msg = ctx['message'];
          for (item in ctx['ml']['inference']['entities']) {
          	if ((item['class_name'] != ctx.redact.ner.skip_entity) &amp;&amp; 
          	  (item['class_probability'] &gt;= ctx.redact.ner.minimum_score)) {  
          		  msg = msg.replace(item['entity'], '&lt;' + 
          		  'REDACTNER-'+ item['class_name'] + '_NER&gt;')
          	}
          }
          ctx.redact.message = msg&quot;&quot;&quot;,
        &quot;on_failure&quot;: [
          {
            &quot;set&quot;: {
              &quot;description&quot;: &quot;Set 'error.message'&quot;,
              &quot;field&quot;: &quot;failure&quot;,
              &quot;value&quot;: &quot;REDACT_REPLACEMENT_SCRIPT_FAILED&quot;,
              &quot;override&quot;: false
            }
          },
          {
            &quot;set&quot;: {
              &quot;field&quot;: &quot;redact.successful&quot;,
              &quot;value&quot;: false
            }
          }
        ]
      }
    },
    
    {
      &quot;set&quot;: {
        &quot;if&quot;: &quot;ctx?.ml?.inference?.entities.size() &gt; 0&quot;, 
        &quot;field&quot;: &quot;redact.ner.found&quot;,
        &quot;value&quot;: true,
        &quot;ignore_failure&quot;: true
      }
    },
    {
      &quot;set&quot;: {
        &quot;if&quot;: &quot;ctx?.redact?.pii?.found == null&quot;,
        &quot;field&quot;: &quot;redact.pii.found&quot;,
        &quot;value&quot;: false
      }
    },
    {
      &quot;set&quot;: {
        &quot;if&quot;: &quot;ctx?.redact?.ner?.found == true&quot;,
        &quot;field&quot;: &quot;redact.pii.found&quot;,
        &quot;value&quot;: true
      }
    },
    {
      &quot;remove&quot;: {
        &quot;if&quot;: &quot;ctx.redact.ner.keep_result != true&quot;,
        &quot;field&quot;: [
          &quot;ml&quot;
        ],
        &quot;ignore_missing&quot;: true,
        &quot;ignore_failure&quot;: true
      }
    }
  ],
  &quot;on_failure&quot;: [
    {
      &quot;set&quot;: {
        &quot;field&quot;: &quot;failure&quot;,
        &quot;value&quot;: &quot;GENERAL_FAILURE&quot;,
        &quot;override&quot;: false
      }
    }
  ]
}
</code></pre>
&lt;/details&gt;
<p>The updated PII Processor Pipeline, which now calls the NER Pipeline</p>
&lt;details open&gt;
  &lt;summary&gt;process-pii pipeline code - click to open/close&lt;/summary&gt;
<pre><code class="language-bash"># Updated Process PII pipeline that now call the NER pipeline
DELETE _ingest/pipeline/process-pii
PUT _ingest/pipeline/process-pii
{
  &quot;processors&quot;: [
    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Set true if enabling sampling, otherwise false&quot;,
        &quot;field&quot;: &quot;sample.enabled&quot;,
        &quot;value&quot;: true
      }
    },
    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Set Sampling Rate 0 None 10000 all allows for 0.01% precision&quot;,
        &quot;field&quot;: &quot;sample.sample_rate&quot;,
        &quot;value&quot;: 1000
      }
    },
    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Set to false if you want to drop unsampled data, handy for reindexing hostorical data&quot;,
        &quot;field&quot;: &quot;sample.keep_unsampled&quot;,
        &quot;value&quot;: true
      }
    },
    {
      &quot;pipeline&quot;: {
        &quot;if&quot;: &quot;ctx.sample.enabled == true&quot;,
        &quot;name&quot;: &quot;logs-sampler&quot;,
        &quot;ignore_failure&quot;: true
      }
    },
    {
      &quot;pipeline&quot;: {
        &quot;if&quot;: &quot;ctx.sample.enabled == false || (ctx.sample.enabled == true &amp;&amp; ctx.sample.sampled == true)&quot;,
        &quot;name&quot;: &quot;logs-ner-pii-processor&quot;
      }
    }
  ]
}

</code></pre>
&lt;/details&gt;
<p>Now reload the data as described here in <a href="#reloading-the-logs">Reloading the logs</a></p>
<h3>Results</h3>
<p>Let's take a look at the results with the NER processing in place. In the Logs Explorer with KQL query bar, execute the following query
<code>data_stream.dataset : pii and ml.inference.entities.class_name : (&quot;PER&quot; and &quot;LOC&quot; and &quot;ORG&quot; )</code></p>
<p>Logs Explorer should look something like this, open the top message to see the details.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/pii-ner-regex-assess-redact-part-1/pii-discover-2-part-1.png" alt="PII Discover 2" /></p>
<h4>NER Model Results</h4>
<p>Lets take a closer look at what these fields mean.</p>
<p><strong>Field:</strong> <code>ml.inference.entities.class_name</code><br />
<strong>Sample Value:</strong> <code>[PER, PER, LOC, ORG, ORG]</code><br />
<strong>Description:</strong> An array of the named entity classes that the NER model has identified.</p>
<p><strong>Field:</strong> <code>ml.inference.entities.class_probability</code><br />
<strong>Sample Value:</strong> <code>[0.999, 0.972, 0.896, 0.506, 0.595]</code><br />
<strong>Description:</strong> The class_probability is a value between 0 and 1, which indicates how likely it is that a given data point belongs to a certain class. The higher the number, the higher the probability that the data point belongs to the named class. <strong>This is important as in the next blog we can decide a threshold that we will want to use to alert and redact on.</strong>'
You can see in this example it identified a <code>LOC</code> as an <code>ORG</code>, we can filter this out / find them by setting a threshold.</p>
<p><strong>Field:</strong> <code>ml.inference.entities.entity</code><br />
<strong>Sample Value:</strong> <code>[Paul Buck, Steven Glens, South Amyborough, ME, Costco]</code><br />
<strong>Description:</strong> The array of entities identified that align positionally with the <code>class_name</code> and <code>class_probability</code>.</p>
<p><strong>Field:</strong> <code>ml.inference.predicted_value</code><br />
<strong>Sample Value:</strong> <code>[2024-09-23T14:32:14.608207-07:00Z] log.level=INFO: Payment successful for order #4594 (user: [Paul Buck](PER&amp;Paul+Buck), david59@burgess.net). Phone: 726-632-0527x520, Address: 3713 [Steven Glens](PER&amp;Steven+Glens), [South Amyborough](LOC&amp;South+Amyborough), [ME](ORG&amp;ME) 93580, Ordered from: [Costco](ORG&amp;Costco)</code><br />
<strong>Description:</strong> The predicted value of the model.</p>
<h4>PII Assessment Dashboard</h4>
<p>Lets take a quick look at a dashboard built to assess PII the data.</p>
<p>To load the dashboard, go to Kibana -&gt; Stack Management -&gt; Saved Objects and import the <code>pii-dashboard-part-1.ndjson</code> file that can be found here:</p>
<p><a href="https://github.com/bvader/elastic-pii/blob/main/elastic/blog-part-1/pii-dashboard-part-1.ndjson">https://github.com/bvader/elastic-pii/blob/main/elastic/blog-part-1/pii-dashboard-part-1.ndjson</a></p>
<p>More complete instructions on Kibana Saved Objects can be found <a href="https://www.elastic.co/guide/en/kibana/current/managing-saved-objects.html">here</a>.</p>
<p>After loading the dashboard, navigate to it and select the right time range and you should see something like below. It shows metrics such as sample rate, percent of logs with NER, NER Score Trends etc. We will examine the assessment and actions in part 2 of this blog.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/pii-ner-regex-assess-redact-part-1/pii-dashboard-1-part-1.png" alt="PII Dashboard 1" /></p>
<h2>Summary and Next Steps</h2>
<p>In this first part of the blog, we have accomplished the following.</p>
<ul>
<li>Reviewed the techniques and tools we have available for PII detection and assement</li>
<li>Reviewed NLP / NER role in PII detection and assessment</li>
<li>Built the necessary composable ingest pipelines to sample logs and run them through the NER Model</li>
<li>Reviewed the NER results and are ready to move to the second blog</li>
</ul>
<p>In the upcoming <a href="https://www.elastic.co/observability-labs/blog/pii-ner-regex-assess-redact-part-2">Part 2 of this blog</a> of this blog, we will cover the following:</p>
<ul>
<li>Redact PII using NER and redact processor</li>
<li>Apply field-level security to control access to the un-redacted data</li>
<li>Enhance the dashboards and alerts</li>
<li>Production considerations and scaling</li>
<li>How to run these processes on incoming or historical data</li>
</ul>
<h2>Data Loading Appendix</h2>
<h4>Code</h4>
<p>The data loading code can be found here:</p>
<p><a href="https://github.com/bvader/elastic-pii">https://github.com/bvader/elastic-pii</a></p>
<pre><code>$ git clone https://github.com/bvader/elastic-pii.git
</code></pre>
<h4>Creating and Loading the Sample Data Set</h4>
<pre><code>$ cd elastic-pii
$ cd python
$ python -m venv .env
$ source .env/bin/activate
$ pip install elasticsearch
$ pip install Faker
</code></pre>
<p>Run the log generator</p>
<pre><code>$ python generate_random_logs.py
</code></pre>
<p>If you do not changes any parameters, this will create 10000 random logs in a file named pii.log with a mix of logs that containe and do not contain PII.</p>
<p>Edit <code>load_logs.py</code> and set the following</p>
<pre><code># The Elastic User 
ELASTIC_USER = &quot;elastic&quot;

# Password for the 'elastic' user generated by Elasticsearch
ELASTIC_PASSWORD = &quot;askdjfhasldfkjhasdf&quot;

# Found in the 'Manage Deployment' page
ELASTIC_CLOUD_ID = &quot;deployment:sadfjhasfdlkjsdhf3VuZC5pbzo0NDMkYjA0NmQ0YjFiYzg5NDM3ZDgxM2YxM2RhZjQ3OGE3MzIkZGJmNTE0OGEwODEzNGEwN2E3M2YwYjcyZjljYTliZWQ=&quot;
</code></pre>
<p>Then run the following command.</p>
<pre><code>$ python load_logs.py
</code></pre>
<h4>Reloading the logs</h4>
<p><strong>Note</strong> To reload the logs, you can simply re-run the above command. You can run the command multiple time during this exercise and the logs will be reloaded (actually loaded again). The new logs will not collide with previous runs as there will be a unique <code>run.id</code> for each run which is displayed at the end of the loading process.</p>
<pre><code>$ python load_logs.py
</code></pre>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/pii-ner-regex-assess-redact-part-1/pii-ner-regex-assess-redact-part-1.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Using NLP and Pattern Matching to Detect, Assess, and Redact PII in Logs - Part 2]]></title>
            <link>https://www.elastic.co/observability-labs/blog/pii-ner-regex-assess-redact-part-2</link>
            <guid isPermaLink="false">pii-ner-regex-assess-redact-part-2</guid>
            <pubDate>Tue, 22 Oct 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[How to detect, assess, and redact PII in your logs using Elasticsearch, NLP and Pattern Matching]]></description>
            <content:encoded><![CDATA[<h2>Introduction:</h2>
<p>The prevalence of high-entropy logs in distributed systems has significantly raised the risk of PII (Personally Identifiable Information) seeping into our logs, which can result in security and compliance issues. This 2-part blog delves into the crucial task of identifying and managing this issue using the Elastic Stack. We will explore using NLP (Natural Language Processing) and Pattern matching to detect, assess, and, where feasible, redact PII from logs being ingested into Elasticsearch.</p>
<p>In <a href="https://www.elastic.co/observability-labs/blog/pii-ner-regex-assess-redact-part-1">Part 1 of this blog</a>, we covered the following:</p>
<ul>
<li>Review the techniques and tools we have available to manage PII in our logs</li>
<li>Understand the roles of NLP / NER in PII detection</li>
<li>Build a composable processing pipeline to detect and assess PII</li>
<li>Sample logs and run them through the NER Model</li>
<li>Assess the results of the NER Model</li>
</ul>
<p>In <strong>Part 2</strong> of this blog, we will cover the following:</p>
<ul>
<li>Apply the <code>redact</code> regex pattern processor and assess the results</li>
<li>Create Alerts using ESQL</li>
<li>Apply field-level security to control access to the un-redacted data</li>
<li>Production considerations and scaling</li>
<li>How to run these processes on incoming or historical data</li>
</ul>
<p>Reminder of the overall flow we will construct over the 2 blogs:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/pii-ner-regex-assess-redact-part-1/pii-overall-flow.png" alt="PII Overall Flow" /></p>
<p>All code for this exercise can be found at:
<a href="https://github.com/bvader/elastic-pii">https://github.com/bvader/elastic-pii</a>.</p>
<h3>Part 1 Prerequisites</h3>
<p>This blog picks up where <a href="https://www.elastic.co/observability-labs/blog/pii-ner-regex-assess-redact-part-1">Part 1 of this blog</a> left off. You must have the NER model, ingest pipelines, and dashboard from Part 1 installed and working.</p>
<ul>
<li>Loaded and configured NER Model</li>
<li>Installed all the composable ingest pipelines from Part 1 of the blog</li>
<li>Installed dashboard</li>
</ul>
<p>You can access the <a href="https://github.com/bvader/elastic-pii/blob/main/elastic/blog-part-1/logs-sampler-composable-pipelines-blog-1-complete.json">complete solution for Blog 1 here</a>. Don't forget to load the dashboard, found <a href="https://github.com/bvader/elastic-pii/blob/main/elastic/blog-part-1/pii-dashboard-part-1.ndjson">here</a>.</p>
<h3>Applying the Redact Processor</h3>
<p>Next, we will apply the <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/redact-processor.html"><code>redact</code> processor</a>. The <code>redact</code> processor is a simple regex-based processor that takes a list of regex patterns and looks for them in a field and replaces them with literals when found. The <code>redact</code> processor is reasonably performant and can run at scale. At the end, we will discuss this in detail in the <a href="#production-scaling">production scaling</a> section.</p>
<p>Elasticsearch comes packaged with a number of useful predefined <a href="https://github.com/elastic/elasticsearch/blob/8.15/libs/grok/src/main/resources/patterns/ecs-v1">patterns</a> that can be conveniently referenced by the <code>redact</code> processor. If one does not suit your needs, create a new pattern with a custom definition. The Redact processor replaces every occurrence of a match. If there are multiple matches, they will all be replaced with the pattern name.</p>
<p>In the code below, we leveraged some of the predefined patterns as well as constructing several custom patterns.</p>
<pre><code class="language-bash">        &quot;patterns&quot;: [
          &quot;%{EMAILADDRESS:EMAIL_REGEX}&quot;,      &lt;&lt; Predefined
          &quot;%{IP:IP_ADDRESS_REGEX}&quot;,           &lt;&lt; Predefined
          &quot;%{CREDIT_CARD:CREDIT_CARD_REGEX}&quot;, &lt;&lt; Custom
          &quot;%{SSN:SSN_REGEX}&quot;,                 &lt;&lt; Custom
          &quot;%{PHONE:PHONE_REGEX}&quot;              &lt;&lt; Custom
        ]
</code></pre>
<p>We also replaced the PII with easily identifiable patterns we can use for assessment.</p>
<p>In addition, it is important to note that since the redact processor is a simple regex find and replace, it can be used against many &quot;secrets&quot; patterns, not just PII. There are many references for regex and secrets patterns, so you can reuse this capability to detect secrets in your logs.</p>
<p><a href="https://github.com/bvader/elastic-pii/blob/main/elastic/blog-part-2/pii-redact-composable-pipelines-blog-2-redact-processor-1.json">The code can be found here</a> for the following two sections of code.</p>
&lt;details open&gt;
  &lt;summary&gt;redact processor pipeline code - click to open/close&lt;/summary&gt;
<pre><code class="language-bash"># Add the PII redact processor pipeline
DELETE _ingest/pipeline/logs-pii-redact-processor
PUT _ingest/pipeline/logs-pii-redact-processor
{
  &quot;processors&quot;: [
    {
      &quot;set&quot;: {
        &quot;field&quot;: &quot;redact.proc.successful&quot;,
        &quot;value&quot;: true
      }
    },
    {
      &quot;set&quot;: {
        &quot;field&quot;: &quot;redact.proc.found&quot;,
        &quot;value&quot;: false
      }
    },
    {
      &quot;set&quot;: {
        &quot;if&quot;: &quot;ctx?.redact?.message == null&quot;,
        &quot;field&quot;: &quot;redact.message&quot;,
        &quot;copy_from&quot;: &quot;message&quot;
      }
    },
    {
      &quot;redact&quot;: {
        &quot;field&quot;: &quot;redact.message&quot;,
        &quot;prefix&quot;: &quot;&lt;REDACTPROC-&quot;,
        &quot;suffix&quot;: &quot;&gt;&quot;,
        &quot;patterns&quot;: [
          &quot;%{EMAILADDRESS:EMAIL_REGEX}&quot;,
          &quot;%{IP:IP_ADDRESS_REGEX}&quot;,
          &quot;%{CREDIT_CARD:CREDIT_CARD_REGEX}&quot;,
          &quot;%{SSN:SSN_REGEX}&quot;,
          &quot;%{PHONE:PHONE_REGEX}&quot;
        ],
        &quot;pattern_definitions&quot;: {
          &quot;CREDIT_CARD&quot;: &quot;&quot;&quot;\d{4}[ -]\d{4}[ -]\d{4}[ -]\d{4}&quot;&quot;&quot;,
          &quot;SSN&quot;: &quot;&quot;&quot;\d{3}-\d{2}-\d{4}&quot;&quot;&quot;,
          &quot;PHONE&quot;: &quot;&quot;&quot;(\+\d{1,2}\s?)?1?\-?\.?\s?\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}&quot;&quot;&quot;
        },
        &quot;on_failure&quot;: [
          {
            &quot;set&quot;: {
              &quot;description&quot;: &quot;Set 'error.message'&quot;,
              &quot;field&quot;: &quot;failure&quot;,
              &quot;value&quot;: &quot;REDACT_PROCESSOR_FAILED&quot;,
              &quot;override&quot;: false
            }
          },
          {
            &quot;set&quot;: {
              &quot;field&quot;: &quot;redact.proc.successful&quot;,
              &quot;value&quot;: false
            }
          }
        ]
      }
    },
    {
      &quot;set&quot;: {
        &quot;if&quot;: &quot;ctx?.redact?.message.contains('REDACTPROC')&quot;,
        &quot;field&quot;: &quot;redact.proc.found&quot;,
        &quot;value&quot;: true
      }
    },
    {
      &quot;set&quot;: {
        &quot;if&quot;: &quot;ctx?.redact?.pii?.found == null&quot;,
        &quot;field&quot;: &quot;redact.pii.found&quot;,
        &quot;value&quot;: false
      }
    },
    {
      &quot;set&quot;: {
        &quot;if&quot;: &quot;ctx?.redact?.proc?.found == true&quot;,
        &quot;field&quot;: &quot;redact.pii.found&quot;,
        &quot;value&quot;: true
      }
    }
  ],
  &quot;on_failure&quot;: [
    {
      &quot;set&quot;: {
        &quot;field&quot;: &quot;failure&quot;,
        &quot;value&quot;: &quot;GENERAL_FAILURE&quot;,
        &quot;override&quot;: false
      }
    }
  ]
}
</code></pre>
&lt;/details&gt;
<p>And now, we will add the <code>logs-pii-redact-processor</code> pipeline to the overall <code>process-pii</code> pipeline</p>
&lt;details open&gt;
  &lt;summary&gt;redact processor pipeline code - click to open/close&lt;/summary&gt;
<pre><code class="language-bash"># Updated Process PII pipeline that now call the NER and Redact Processor pipeline
DELETE _ingest/pipeline/process-pii
PUT _ingest/pipeline/process-pii
{
  &quot;processors&quot;: [
    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Set true if enabling sampling, otherwise false&quot;,
        &quot;field&quot;: &quot;sample.enabled&quot;,
        &quot;value&quot;: true
      }
    },
    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Set Sampling Rate 0 None 10000 all allows for 0.01% precision&quot;,
        &quot;field&quot;: &quot;sample.sample_rate&quot;,
        &quot;value&quot;: 1000
      }
    },
    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Set to false if you want to drop unsampled data, handy for reindexing hostorical data&quot;,
        &quot;field&quot;: &quot;sample.keep_unsampled&quot;,
        &quot;value&quot;: true
      }
    },
    {
      &quot;pipeline&quot;: {
        &quot;if&quot;: &quot;ctx.sample.enabled == true&quot;,
        &quot;name&quot;: &quot;logs-sampler&quot;,
        &quot;ignore_failure&quot;: true
      }
    },
    {
      &quot;pipeline&quot;: {
        &quot;if&quot;: &quot;ctx.sample.enabled == false || (ctx.sample.enabled == true &amp;&amp; ctx.sample.sampled == true)&quot;,
        &quot;name&quot;: &quot;logs-ner-pii-processor&quot;
      }
    },
    {
      &quot;pipeline&quot;: {
        &quot;if&quot;: &quot;ctx.sample.enabled == false || (ctx.sample.enabled == true &amp;&amp;  ctx.sample.sampled == true)&quot;,
        &quot;name&quot;: &quot;logs-pii-redact-processor&quot;
      }
    }
  ]
}
</code></pre>
&lt;/details&gt;
<p>Reload the data as described in the <a href="#reloading-the-logs">Reloading the logs</a>. If you have not generated the logs the first time, follow the instructions in the <a href="#data-loading-appendix">Data Loading Appendix</a></p>
<p>Go to Discover and enter the following into the KQL bar
<code>sample.sampled : true and redact.message: REDACTPROC</code> and add the <code>redact.message</code> to the table and you should see something like this.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/pii-ner-regex-assess-redact-part-2/pii-discover-1-part-2.png" alt="PII Discover Blog 2 Part 1" /></p>
<p>And if you did not load the dashboard from Blog Part 1 at already, load it, it can be found <a href="https://github.com/bvader/elastic-pii/blob/main/elastic/blog-part-1/pii-dashboard-part-1.ndjson">here</a> using the Kibana -&gt; Stack Management -&gt; Saved Objects -&gt; Import.</p>
<p>It should look something like this now. Note that the REGEX portions of the dashboard are now active.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/pii-ner-regex-assess-redact-part-2/pii-dashboard-1-part-2.png" alt="PII Dashboards Blog 2 Part 1" /></p>
<h2>Checkpoint</h2>
<p>At this point, we have the following capabilities:</p>
<ul>
<li>Ability to sample incoming logs and apply this PII redaction</li>
<li>Detect and Assess PII with the NER/NLP and Pattern Matching</li>
<li>Assess the amount, type and quality of the PII detections</li>
</ul>
<p>This is a great point to stop if you are just running all this once to see how it works, but we have a few more steps to make this useful in production systems.</p>
<ul>
<li>Clean up the working and unredacted data</li>
<li>Update the Dashboard to work with the cleaned-up data</li>
<li>Apply Role Based Access Control to protect the raw  unredacted data</li>
<li>Create Alerts</li>
<li>Production and Scaling Considerations</li>
<li>How to run these processes on incoming or historical data</li>
</ul>
<h2>Applying to Production Systems</h2>
<h3>Cleanup working data and update the dashboard</h3>
<p>And now we will add the cleanup code to the overall <code>process-pii</code> pipeline.</p>
<p>In short, we set a flag <code>redact.enable: true</code> that directs the pipeline to move the unredacted <code>message</code> field to <code>raw.message</code> and the move the redacted message field <code>redact.message</code>to the <code>message</code> field. We will &quot;protect&quot; the <code>raw.message</code> in the following section.</p>
<p><strong>NOTE:</strong> Of course you can change this behavior if you want to completely delete the unredacted data. In this exercise we will keep it and protect it.</p>
<p>In addition we set <code>redact.cleanup: true</code> to clean up the NLP working data.</p>
<p>These fields allow a lot of control over what data you decide to keep and analyze.</p>
<p><a href="https://github.com/bvader/elastic-pii/blob/main/elastic/blog-part-2/pii-redact-composable-pipelines-blog-2-redact-processor-2.json">The code can be found here</a> for the following two sections of code.</p>
&lt;details open&gt;
  &lt;summary&gt;redact processor pipeline code - click to open/close&lt;/summary&gt;
<pre><code class="language-bash"># Updated Process PII pipeline that now call the NER and Redact Processor pipeline and cleans up 
DELETE _ingest/pipeline/process-pii
PUT _ingest/pipeline/process-pii
{
  &quot;processors&quot;: [
    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Set true if enabling sampling, otherwise false&quot;,
        &quot;field&quot;: &quot;sample.enabled&quot;,
        &quot;value&quot;: true
      }
    },
    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Set Sampling Rate 0 None 10000 all allows for 0.01% precision&quot;,
        &quot;field&quot;: &quot;sample.sample_rate&quot;,
        &quot;value&quot;: 1000
      }
    },
    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Set to false if you want to drop unsampled data, handy for reindexing hostorical data&quot;,
        &quot;field&quot;: &quot;sample.keep_unsampled&quot;,
        &quot;value&quot;: true
      }
    },
    {
      &quot;pipeline&quot;: {
        &quot;if&quot;: &quot;ctx.sample.enabled == true&quot;,
        &quot;name&quot;: &quot;logs-sampler&quot;,
        &quot;ignore_failure&quot;: true
      }
    },
    {
      &quot;pipeline&quot;: {
        &quot;if&quot;: &quot;ctx.sample.enabled == false || (ctx.sample.enabled == true &amp;&amp; ctx.sample.sampled == true)&quot;,
        &quot;name&quot;: &quot;logs-ner-pii-processor&quot;
      }
    },
    {
      &quot;pipeline&quot;: {
        &quot;if&quot;: &quot;ctx.sample.enabled == false || (ctx.sample.enabled == true &amp;&amp;  ctx.sample.sampled == true)&quot;,
        &quot;name&quot;: &quot;logs-pii-redact-processor&quot;
      }
    },
    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Set to true to actually redact, false will run processors but leave original&quot;,
        &quot;field&quot;: &quot;redact.enable&quot;,
        &quot;value&quot;: true
      }
    },
    {
      &quot;rename&quot;: {
        &quot;if&quot;: &quot;ctx?.redact?.pii?.found == true &amp;&amp; ctx?.redact?.enable == true&quot;,
        &quot;field&quot;: &quot;message&quot;,
        &quot;target_field&quot;: &quot;raw.message&quot;
      }
    },
    {
      &quot;rename&quot;: {
        &quot;if&quot;: &quot;ctx?.redact?.pii?.found == true &amp;&amp; ctx?.redact?.enable == true&quot;,
        &quot;field&quot;: &quot;redact.message&quot;,
        &quot;target_field&quot;: &quot;message&quot;
      }
    },
    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Set to true to actually to clean up working data&quot;,
        &quot;field&quot;: &quot;redact.cleanup&quot;,
        &quot;value&quot;: true
      }
    },
    {
      &quot;remove&quot;: {
        &quot;if&quot;: &quot;ctx?.redact?.cleanup == true&quot;,
        &quot;field&quot;: [
          &quot;ml&quot;
        ],
        &quot;ignore_failure&quot;: true
      }
    }
  ]
}
</code></pre>
&lt;/details&gt;
<p>Reload the data as described here in the <a href="#reloading-the-logs">Reloading the logs</a>.</p>
<p>Go to Discover and enter the following into the KQL bar
<code>sample.sampled : true and redact.pii.found: true</code> and add the following fields to the table</p>
<p><code>message</code>,<code>raw.message</code>,<code>redact.ner.found</code>,<code>redact.proc.found</code>,<code>redact.pii.found</code></p>
<p>You should see something like this
<img src="https://www.elastic.co/observability-labs/assets/images/pii-ner-regex-assess-redact-part-2/pii-discover-2-part-2.png" alt="PII Discover Part 2 Blog 2" /></p>
<p>We have everything we need to move forward with protecting the PII and Alerting on it.</p>
<p>Load up the new dashboard that works on the cleaned-up data</p>
<p>To load the dashboard, go to Kibana -&gt; Stack Management -&gt; Saved Objects and import the <code>pii-dashboard-part-2.ndjson</code> file that can be found <a href="https://github.com/bvader/elastic-pii/blob/main/elastic/blog-part-2/pii-dashboard-part-2.ndjson">here</a>.</p>
<p>The new dashboard should look like this. Note: It uses different fields under the covers since we have cleaned up the underlying data.</p>
<p>You should see something like this
<img src="https://www.elastic.co/observability-labs/assets/images/pii-ner-regex-assess-redact-part-2/pii-dashboard-2-part-2.png" alt="PII Dashboard Part 2 Blog 2" /></p>
<h3>Apply Role Based Access Control to protect the raw unredacted data</h3>
<p>Elasticsearch supports role-based access control, including field and document level access control natively; it dramatically reduces the operational and maintenance complexity required to secure our application.</p>
<p>We will create a Role that does not allow access to the <code>raw.message</code> field and then create a user and assign that user the role. With that role, the user will only be able to see the redacted message, which is now in the <code>message</code> field, but will not be able to access the protected <code>raw.message</code> field.</p>
<p><strong>NOTE:</strong> Since we only sampled 10% of the data in this exercise the non-sampled <code>message</code> fields are not moved to the <code>raw.message</code>, so they are still viewable, but this shows the capability you can apply in a production system.</p>
<p><a href="https://github.com/bvader/elastic-pii/blob/main/elastic/blog-part-2/pii-redact-composable-pipelines-blog-2-rbac.json">The code can be found here</a> for the following section of code.</p>
&lt;details open&gt;
  &lt;summary&gt;RBAC protect-pii role and user code - click to open/close&lt;/summary&gt;
<pre><code class="language-bash"># Create role with no access to the raw.message field
GET _security/role/protect-pii
DELETE _security/role/protect-pii
PUT _security/role/protect-pii
{
 &quot;cluster&quot;: [],
 &quot;indices&quot;: [
   {
     &quot;names&quot;: [
       &quot;logs-*&quot;
     ],
     &quot;privileges&quot;: [
       &quot;read&quot;,
       &quot;view_index_metadata&quot;
     ],
     &quot;field_security&quot;: {
       &quot;grant&quot;: [
         &quot;*&quot;
       ],
       &quot;except&quot;: [
         &quot;raw.message&quot;
       ]
     },
     &quot;allow_restricted_indices&quot;: false
   }
 ],
 &quot;applications&quot;: [
   {
     &quot;application&quot;: &quot;kibana-.kibana&quot;,
     &quot;privileges&quot;: [
       &quot;all&quot;
     ],
     &quot;resources&quot;: [
       &quot;*&quot;
     ]
   }
 ],
 &quot;run_as&quot;: [],
 &quot;metadata&quot;: {},
 &quot;transient_metadata&quot;: {
   &quot;enabled&quot;: true
 }
}

# Create user stephen with protect-pii role
GET _security/user/stephen
DELETE /_security/user/stephen
POST /_security/user/stephen
{
 &quot;password&quot; : &quot;mypassword&quot;,
 &quot;roles&quot; : [ &quot;protect-pii&quot; ],
 &quot;full_name&quot; : &quot;Stephen Brown&quot;
}

</code></pre>
 &lt;/details&gt;
<p>Now log into a separate window with the new user <code>stephen</code> with the <code>protect-pii role</code>. Go to Discover and put <code>redact.pii.found : true</code> in the KQL bar and add the <code>message</code> field to the table. Also, notice that the <code>raw.message</code> is not available.</p>
<p>You should see something like this
<img src="https://www.elastic.co/observability-labs/assets/images/pii-ner-regex-assess-redact-part-2/pii-discover-3-part-2.png" alt="PII Dashboard Part 2 Blog 2" /></p>
<h3>Create an Alert when PII Detected</h3>
<p>Now, with the processing of the pipelines, creating an alert when PII is detected is easy. To review <a href="https://www.elastic.co/guide/en/kibana/current/alerting-getting-started.html">Alerting in Kibana</a> in detail if needed</p>
<p>NOTE: <a href="#reloading-the-logs">Reload</a> the data if needed to have recent data.</p>
<p>First, we will create a simple <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/esql.html">ES|QL query</a> in Discover.</p>
<p><a href="https://github.com/bvader/elastic-pii/blob/main/elastic/blog-part-2/pii-redact-esql-alert-blog-2.txt">The code can be found here.</a></p>
<pre><code>FROM logs-pii-default
| WHERE redact.pii.found == true
| STATS pii_count = count(*)
| WHERE pii_count &gt; 0
</code></pre>
<p>When you run this you should see something like this.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/pii-ner-regex-assess-redact-part-2/pii-esql-1-part-2.png" alt="PII ESQL Part 1 Blog 2" /></p>
<p>Now click the Alerts menu and select <code>Create search threshold rule</code>, and will create an alert to alert us when PII is found.</p>
<p><strong>Select a time field: @timestamp
Set the time window: 5 minutes</strong></p>
<p>Assuming you loaded the data recently when you run <strong>Test</strong> it should do something like</p>
<p>pii_count : <code>343</code>
Alerts generated <code>query matched</code></p>
<p>Add an action when the alert is Active.</p>
<p><strong>For each alert: <code>On status changes</code>
Run when: <code>Query matched</code></strong></p>
<pre><code>Elasticsearch query rule {{rule.name}} is active:

- PII Found: true
- PII Count: {{#context.hits}} {{_source.pii_count}}{{/context.hits}}
- Conditions Met: {{context.conditions}} over {{rule.params.timeWindowSize}}{{rule.params.timeWindowUnit}}
- Timestamp: {{context.date}}
- Link: {{context.link}}
</code></pre>
<p>Add an Action for when the Alert is Recovered.</p>
<p><strong>For each alert: <code>On status changes</code>
Run when: <code>Recovered</code></strong></p>
<pre><code>Elasticsearch query rule {{rule.name}} is Recovered:

- PII Found: false
- Conditions Not Met: {{context.conditions}} over {{rule.params.timeWindowSize}}{{rule.params.timeWindowUnit}}
- Timestamp: {{context.date}}
- Link: {{context.link}}
</code></pre>
<p>When all setup it should look like this and <code>Save</code></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/pii-ner-regex-assess-redact-part-2/pii-alert-1-part2.png" alt="Alert Setup" /><br />
<img src="https://www.elastic.co/observability-labs/assets/images/pii-ner-regex-assess-redact-part-2/pii-alert-2-part2.png" alt="Action Alert" /><br />
<img src="https://www.elastic.co/observability-labs/assets/images/pii-ner-regex-assess-redact-part-2/pii-alert-3-part2.png" alt="Action Alert" /></p>
<p>You should get an Active alert that looks like this if you have recent data. I sent mine to Slack.</p>
<pre><code>Elasticsearch query rule pii-found-esql is active:
- PII Found: true
- PII Count:  374
- Conditions Met: Query matched documents over 5m
- Timestamp: 2024-10-15T02:44:52.795Z
- Link: https://mydeployment123.aws.found.io:9243/app/management/insightsAndAlerting/triggersActions/rule/7d6faecf-964e-46da-aaba-8a2f89f33989
</code></pre>
<p>And then if you wait you will get a Recovered alert that looks like this.</p>
<pre><code>Elasticsearch query rule pii-found-esql is Recovered:
- PII Found: false
- Conditions Not Met: Query did NOT match documents over 5m
- Timestamp: 2024-10-15T02:49:04.815Z
- Link: https://mydeployment123.kb.us-west-1.aws.found.io:9243/app/management/insightsAndAlerting/triggersActions/rule/7d6faecf-964e-46da-aaba-8a2f89f33989
</code></pre>
<h3>Production Scaling</h3>
<h4>NER Scaling</h4>
<p>As we mentioned <a href="https://www.elastic.co/observability-labs/blog/pii-ner-regex-assess-redact-part-1#named-entity-recognition-ner-detection">Part 1 of this blog</a> of this blog, NER / NLP Models are CPU-intensive and expensive to run at scale; thus, we employed a sampling technique to understand the risk in our logs without sending the full logs volume through the NER Model.</p>
<p>Please review <a href="https://www.elastic.co/observability-labs/blog/pii-ner-regex-assess-redact-part-1#loading-configuration-and-execution-of-the-ner-pipeline">the setup and configuration of the NER</a> model from Part 1 of the blog.</p>
<p>We chose the base BERT NER model <a href="https://huggingface.co/dslim/bert-base-NER">bert-base-NER</a> for our PII case.</p>
<p>To scale ingest, we will focus on scaling the allocations for the deployed model. More information on this topic is available <a href="https://www.elastic.co/guide/en/machine-learning/current/ml-nlp-deploy-model.html">here</a>. The number of allocations must be less than the available allocated processors (cores, not vCPUs) per node.</p>
<p>The metrics below are related to the model and configuration from Part 1 of the blog.</p>
<ul>
<li>4 Allocations to allow for more parallel ingestion</li>
<li>1 Thread per Allocation</li>
<li>0 Byes Cache, as we expect a low cache hit rate
<strong>Note</strong> If there are many repeated logs, cache can help, but with timestamps and other variations, cache will not help and can even slow down the process</li>
<li>8192 Queue</li>
</ul>
<pre><code class="language-bash">GET _ml/trained_models/dslim__bert-base-ner/_stats
.....
           &quot;node&quot;: {
              &quot;0m4tq7tMRC2H5p5eeZoQig&quot;: {
.....
                &quot;attributes&quot;: {
                  &quot;xpack.installed&quot;: &quot;true&quot;,
                  &quot;region&quot;: &quot;us-west-1&quot;,
                  &quot;ml.allocated_processors&quot;: &quot;5&quot;, &lt;&lt; HERE 
.....
            },
            &quot;inference_count&quot;: 5040,
            &quot;average_inference_time_ms&quot;: 138.44285714285715, &lt;&lt; HERE 
            &quot;average_inference_time_ms_excluding_cache_hits&quot;: 138.44285714285715,
            &quot;inference_cache_hit_count&quot;: 0,
.....
            &quot;threads_per_allocation&quot;: 1,
            &quot;number_of_allocations&quot;: 4,  &lt;&lt;&lt; HERE
            &quot;peak_throughput_per_minute&quot;: 1550,
            &quot;throughput_last_minute&quot;: 1373,
            &quot;average_inference_time_ms_last_minute&quot;: 137.55280407865988,
            &quot;inference_cache_hit_count_last_minute&quot;: 0
          }
        ]
      }
    }
</code></pre>
<p>There are 3 key pieces of information above:</p>
<ul>
<li>
<p><code>&quot;ml.allocated_processors&quot;: &quot;5&quot;</code>
The number of physical cores / processors available</p>
</li>
<li>
<p><code>&quot;number_of_allocations&quot;: 4</code>
The number of allocations which is maximum 1 per physical core. <strong>Note</strong>: we could have used 5 allocations, but we only allocated 4 for this exercise</p>
</li>
<li>
<p><code>&quot;average_inference_time_ms&quot;: 138.44285714285715</code>
The averages inference time per document.</p>
</li>
</ul>
<p>The math is pretty straightforward for throughput for Inferences per Min (IPM) per allocation (1 allocation per physical core), since an inference uses a single core and a single thread.</p>
<p>Then the Inferences per Min per Allocation is simply:</p>
<p><code>IPM per allocation = 60,000 ms (in a minute) / 138ms per inference = 435</code></p>
<p>When then lines up with the Total Inferences per Minute</p>
<p><code>Total IPM = 435 IPM / allocation * 4 Allocations = ~1740</code></p>
<p>Suppose we want to do 10,000 IPMs, how many allocations (cores) would I need?</p>
<p><code>Allocations = 10,000 IPM / 435 IPM per allocation = 23 Allocation (cores rounded up)</code></p>
<p>Or perhaps logs are coming in at 5000 EPS and you want to do 1% Sampling.</p>
<p><code>IPM = 5000 EPS * 60sec * 0.01 sampling = 3000 IPM sampled</code></p>
<p>Then</p>
<p><code>Number of Allocators = 3000 IPM / 435 IPM per allocation = 7 allocations (cores rounded up)</code></p>
<p><strong>Want Faster!</strong> Turns out there is a more lightweight NER Model <a href="https://huggingface.co/dslim/distilbert-NER">
distilbert-NER</a> model that is faster, but the tradeoff is a little less accuracy.</p>
<p>Running the logs through this model results in an inference time nearly twice as fast!</p>
<p><code>&quot;average_inference_time_ms&quot;: 66.0263959390863</code></p>
<p>Here is some quick math:
<code>$IPM per allocation = 60,000 ms (in a minute) / 61ms per inference = 983</code></p>
<p>Suppose we want to do 25,000 IPMs, how many allocations (cores) would I need?</p>
<p><code>Allocations = 25,000 IPM / 983 IPM per allocation = 26 Allocation (cores rounded up)</code></p>
<p><strong>Now you can apply this math to determine the correct sampling and NER scaling to support your logging use case.</strong></p>
<h4>Redact Processor Scaling</h4>
<p>In short, the <code>redact</code> processor should scale to production loads as long as you are using appropriately sized and configured nodes and have well-constructed regex patterns.</p>
<h3>Assessing incoming logs</h3>
<p>If you want to test on incoming logs data in a data stream. All you need to do is change the conditional in the <code>logs@custom</code> pipeline to apply the <code>process-pii</code> to the dataset you want to. You can use any conditional that fits your condition.</p>
<p>Note: Just make sure that you have accounted for the proper scaling for the NER and Redact processors they were described above in <a href="#production-scaling">Production Scaling</a></p>
<pre><code class="language-bash">    {
      &quot;pipeline&quot;: {
        &quot;description&quot; : &quot;Call the process_pii pipeline on the correct dataset&quot;,
        &quot;if&quot;: &quot;ctx?.data_stream?.dataset == 'pii'&quot;, &lt;&lt;&lt; HERE
        &quot;name&quot;: &quot;process-pii&quot;
      }
    }
</code></pre>
<p>So if for example your logs are coming into <code>logs-mycustomapp-default</code> you would just change the conditional to</p>
<pre><code>        &quot;if&quot;: &quot;ctx?.data_stream?.dataset == 'mycustomapp'&quot;,
</code></pre>
<h3>Assessing historical data</h3>
<p>If you have a historical (already ingested) data stream or index you can run the assessment over them using the <code>_reindex</code> API&gt;</p>
<p>Note: Just make sure that you have accounted for the proper scaling for the NER and Redact processors they were described above in <a href="#production-scaling">Production Scaling</a></p>
<p>There are a couple of extra steps:
<a href="https://github.com/bvader/elastic-pii/blob/main/elastic/blog-part-2/pii-redact-historical-data-blog-2.json">The code can be found here.</a></p>
<ol>
<li>First we can set the parameters to ONLY keep the sampled data as there is no reason to make a copy of all the unsampled data. In the <code>process-pii</code> pipeline, there is a setting <code>sample.keep_unsampled</code>, which we can set to <code>false</code>, which will then only keep the sampled data</li>
</ol>
<pre><code class="language-bash">    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Set to false if you want to drop unsampled data, handy for reindexing hostorical data&quot;,
        &quot;field&quot;: &quot;sample.keep_unsampled&quot;,
        &quot;value&quot;: false &lt;&lt;&lt; SET TO false
      }
    },
</code></pre>
<ol start="2">
<li>Second, we will create a pipeline that will reroute the data to the correct data stream to run through all the PII assessment/detection pipelines. It also sets the correct <code>dataset</code> and <code>namespace</code></li>
</ol>
<pre><code class="language-bash">DELETE _ingest/pipeline/sendtopii
PUT _ingest/pipeline/sendtopii
{
  &quot;processors&quot;: [
    {
      &quot;set&quot;: {
        &quot;field&quot;: &quot;data_stream.dataset&quot;,
        &quot;value&quot;: &quot;pii&quot;
      }
    },
    {
      &quot;set&quot;: {
        &quot;field&quot;: &quot;data_stream.namespace&quot;,
        &quot;value&quot;: &quot;default&quot;
      }
    },
    {
      &quot;reroute&quot; : 
      {
        &quot;dataset&quot; : &quot;{{data_stream.dataset}}&quot;,
        &quot;namespace&quot;: &quot;{{data_stream.namespace}}&quot;
      }
    }
  ]
}
</code></pre>
<ol start="3">
<li>Finally, we can run a <code>_reindex</code> to select the data we want to test/assess. It is recommended to review the <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html">_reindex</a> documents before trying this. First, select the source data stream you want to assess, in this example, it is the <code>logs-generic-default</code> logs data stream. Note: I also added a <code>range</code> filter to select a specific time range. There is a bit of a &quot;trick&quot; that we need to use since we are re-routing the data to the data stream <code>logs-pii-default</code>. To do this, we just set <code>&quot;index&quot;: &quot;logs-tmp-default&quot;</code> in the <code>_reindex</code> as the correct data stream will be set in the pipeline. We must do that because <code>reroute</code> is a <code>noop</code> if it is called from/to the same datastream.</li>
</ol>
<pre><code class="language-bash">POST _reindex?wait_for_completion=false
{
  &quot;source&quot;: {
    &quot;index&quot;: &quot;logs-generic-default&quot;,
    &quot;query&quot;: {
      &quot;bool&quot;: {
        &quot;filter&quot;: [
          {
            &quot;range&quot;: {
              &quot;@timestamp&quot;: {
                &quot;gte&quot;: &quot;now-1h/h&quot;,
                &quot;lt&quot;: &quot;now&quot;
              }
            }
          }
        ]
      }
    }
  },
  &quot;dest&quot;: {
    &quot;op_type&quot;: &quot;create&quot;,
    &quot;index&quot;: &quot;logs-tmp-default&quot;,
    &quot;pipeline&quot;: &quot;sendtopii&quot;
  }
}
</code></pre>
<h2>Summary</h2>
<p>At this point, you have the tools and processes need to assess, detect, analyze, alert and protect PII in your logs.</p>
<p><a href="https://github.com/bvader/elastic-pii/tree/main/elastic/blog-complete-end-solution">The end state solution can be found here:</a>.</p>
<p>In <a href="https://www.elastic.co/observability-labs/blog/pii-ner-regex-assess-redact-part-1">Part 1 of this blog</a>, we accomplished the following.</p>
<ul>
<li>Reviewed the techniques and tools we have available for PII detection and assessment</li>
<li>Reviewed NLP / NER role in PII detection and assessment</li>
<li>Built the necessary composable ingest pipelines to sample logs and run them through the NER Model</li>
<li>Reviewed the NER results and are ready to move to the second blog</li>
</ul>
<p>In <strong>Part 2</strong> of this blog, we covered the following:</p>
<ul>
<li>Redact PII using NER and redact processor</li>
<li>Apply field-level security to control access to the un-redacted data</li>
<li>Enhance the dashboards and alerts</li>
<li>Production considerations and scaling</li>
<li>How to run these processes on incoming or historical data</li>
</ul>
<p><em><strong>So get to work and reduce risk in your logs!</strong></em></p>
<h2>Data Loading Appendix</h2>
<h4>Code</h4>
<p>The data loading code can be found here:</p>
<p><a href="https://github.com/bvader/elastic-pii">https://github.com/bvader/elastic-pii</a></p>
<pre><code>$ git clone https://github.com/bvader/elastic-pii.git
</code></pre>
<h4>Creating and Loading the Sample Data Set</h4>
<pre><code>$ cd elastic-pii
$ cd python
$ python -m venv .env
$ source .env/bin/activate
$ pip install elasticsearch
$ pip install Faker
</code></pre>
<p>Run the log generator</p>
<pre><code>$ python generate_random_logs.py
</code></pre>
<p>If you do not changes any parameters, this will create 10000 random logs in a file named pii.log with a mix of logs that containe and do not contain PII.</p>
<p>Edit <code>load_logs.py</code> and set the following</p>
<pre><code># The Elastic User 
ELASTIC_USER = &quot;elastic&quot;

# Password for the 'elastic' user generated by Elasticsearch
ELASTIC_PASSWORD = &quot;askdjfhasldfkjhasdf&quot;

# Found in the 'Manage Deployment' page
ELASTIC_CLOUD_ID = &quot;deployment:sadfjhasfdlkjsdhf3VuZC5pbzo0NDMkYjA0NmQ0YjFiYzg5NDM3ZDgxM2YxM2RhZjQ3OGE3MzIkZGJmNTE0OGEwODEzNGEwN2E3M2YwYjcyZjljYTliZWQ=&quot;
</code></pre>
<p>Then run the following command.</p>
<pre><code>$ python load_logs.py
</code></pre>
<h4>Reloading the logs</h4>
<p><strong>Note</strong> To reload the logs, you can simply re-run the above command. You can run the command multiple time during this exercise and the logs will be reloaded (actually loaded again). The new logs will not collide with previous runs as there will be a unique <code>run.id</code> for each run which is displayed at the end of the loading process.</p>
<pre><code>$ python load_logs.py
</code></pre>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/pii-ner-regex-assess-redact-part-2/pii-ner-regex-assess-redact-part-2.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Pruning incoming log volumes with Elastic]]></title>
            <link>https://www.elastic.co/observability-labs/blog/pruning-incoming-log-volumes</link>
            <guid isPermaLink="false">pruning-incoming-log-volumes</guid>
            <pubDate>Fri, 23 Jun 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[To drop or not to drop (events) is the question, not only in deciding what events and fields to remove from your logs but also in the various tools used. Learn about using Beats, Logstash, Elastic Agent, Ingest Pipelines, and OTel Collectors.]]></description>
            <content:encoded><![CDATA[<pre><code class="language-yaml">filebeat.inputs:
  - type: filestream
    id: my-logging-app
    paths:
      - /var/log/*.log
</code></pre>
<pre><code class="language-yaml">filebeat.inputs:
  - type: filestream
    id: my-logging-app
    paths:
      - /var/tmp/other.log
      - /var/log/*.log
processors:
  - drop_event:
      when:
        and:
          - equals:
            url.scheme: http
          - equals:
            url.path: /profile
</code></pre>
<pre><code class="language-yaml">filebeat.inputs:
  - type: filestream
    id: my-logging-app
    paths:
      - /var/tmp/other.log
      - /var/log/*.log
processors:
  - drop_fields:
      when:
        and:
          - equals:
            url.scheme: http
          - equals:
            http.response.status_code: 200
        fields: [&quot;event.message&quot;]
        ignore_missing: false
</code></pre>
<pre><code class="language-ruby">input {
  file {
    id =&gt; &quot;my-logging-app&quot;
    path =&gt; [ &quot;/var/tmp/other.log&quot;, &quot;/var/log/*.log&quot; ]
  }
}
filter {
  if [url.scheme] == &quot;http&quot; &amp;&amp; [url.path] == &quot;/profile&quot; {
    drop {
      percentage =&gt; 80
    }
  }
}
output {
  elasticsearch {
        hosts =&gt; &quot;https://my-elasticsearch:9200&quot;
        data_stream =&gt; &quot;true&quot;
    }
}
</code></pre>
<pre><code class="language-ruby"># Input configuration omitted
filter {
  if [url.scheme] == &quot;http&quot; &amp;&amp; [http.response.status_code] == 200 {
    drop {
      percentage =&gt; 80
    }
    mutate {
      remove_field: [ &quot;event.message&quot; ]
    }
  }
}
# Output configuration omitted
</code></pre>
<pre><code class="language-bash">PUT _ingest/pipeline/my-logging-app-pipeline
{
  &quot;description&quot;: &quot;Event and field dropping for my-logging-app&quot;,
  &quot;processors&quot;: [
    {
      &quot;drop&quot;: {
        &quot;description&quot; : &quot;Drop event&quot;,
        &quot;if&quot;: &quot;ctx?.url?.scheme == 'http' &amp;&amp; ctx?.url?.path == '/profile'&quot;,
        &quot;ignore_failure&quot;: true
      }
    },
    {
      &quot;remove&quot;: {
        &quot;description&quot; : &quot;Drop field&quot;,
        &quot;field&quot; : &quot;event.message&quot;,
        &quot;if&quot;: &quot;ctx?.url?.scheme == 'http' &amp;&amp; ctx?.http?.response?.status_code == 200&quot;,
        &quot;ignore_failure&quot;: false
      }
    }
  ]
}
</code></pre>
<pre><code class="language-bash">PUT _ingest/pipeline/my-logging-app-pipeline
{
  &quot;description&quot;: &quot;Event and field dropping for my-logging-app with failures&quot;,
  &quot;processors&quot;: [
    {
      &quot;drop&quot;: {
        &quot;description&quot; : &quot;Drop event&quot;,
        &quot;if&quot;: &quot;ctx?.url?.scheme == 'http' &amp;&amp; ctx?.url?.path == '/profile'&quot;,
        &quot;ignore_failure&quot;: true
      }
    },
    {
      &quot;remove&quot;: {
        &quot;description&quot; : &quot;Drop field&quot;,
        &quot;field&quot; : &quot;event.message&quot;,
        &quot;if&quot;: &quot;ctx?.url?.scheme == 'http' &amp;&amp; ctx?.http?.response?.status_code == 200&quot;,
        &quot;ignore_failure&quot;: false
      }
    }
  ],
  &quot;on_failure&quot;: [
    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Set 'ingest.failure.message'&quot;,
        &quot;field&quot;: &quot;ingest.failure.message&quot;,
        &quot;value&quot;: &quot;Ingestion issue&quot;
        }
      }
  ]
}
</code></pre>
<pre><code class="language-yaml">receivers:
  filelog:
    include: [/var/tmp/other.log, /var/log/*.log]
processors:
  filter/denylist:
    error_mode: ignore
    logs:
      log_record:
        - 'url.scheme == &quot;info&quot;'
        - 'url.path == &quot;/profile&quot;'
        - &quot;http.response.status_code == 200&quot;
  attributes/errors:
    actions:
      - key: error.message
        action: delete
  memory_limiter:
    check_interval: 1s
    limit_mib: 2000
  batch:
exporters:
  # Exporters configuration omitted
service:
  pipelines:
    # Pipelines configuration omitted
</code></pre>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/pruning-incoming-log-volumes/blog-thumb-elastic-on-elastic.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Root cause analysis with logs: Elastic Observability's anomaly detection and log categorization]]></title>
            <link>https://www.elastic.co/observability-labs/blog/reduce-mttd-ml-machine-learning-observability</link>
            <guid isPermaLink="false">reduce-mttd-ml-machine-learning-observability</guid>
            <pubDate>Tue, 07 Feb 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[Elastic Observability provides more than just log aggregation, metrics analysis, APM, and distributed tracing. Elastic’s machine learning capabilities help analyze the root cause of issues, allowing you to focus your time on the most important tasks.]]></description>
            <content:encoded><![CDATA[<p>With more and more applications moving to the cloud, an increasing amount of telemetry data (logs, metrics, traces) is being collected, which can help improve application performance, operational efficiencies, and business KPIs. However, analyzing this data is extremely tedious and time consuming given the tremendous amounts of data being generated. Traditional methods of alerting and simple pattern matching (visual or simple searching etc) are not sufficient for IT Operations teams and SREs. It’s like trying to find a needle in a haystack.</p>
<p>In this blog post, we’ll cover some of Elastic’s artificial intelligence for IT operations (AIOps) and machine learning (ML) capabilities for root cause analysis.</p>
<p>Elastic’s machine learning will help you investigate performance issues by providing anomaly detection and pinpointing potential root causes through time series analysis and log outlier detection. These capabilities will help you reduce time in finding that “needle” in the haystack.</p>
<p>Elastic’s platform enables you to get started on machine learning quickly. You don’t need to have a data science team or design a system architecture. Additionally, there’s no need to move data to a third-party framework for model training.</p>
<p>Preconfigured machine learning models for observability and security are available. If those don't work well enough on your data, in-tool wizards guide you through the few steps needed to configure custom anomaly detection and train your model with supervised learning. To help get you started, there are several key features built into Elastic Observability to aid in analysis, helping bypass the need to run specific ML models. These features help minimize the time and analysis for logs.</p>
<p>Let’s review some of these built-in ML features:</p>
<p><strong>Anomaly detection:</strong> Elastic Observability, when turned on (<a href="https://www.elastic.co/guide/en/kibana/current/xpack-ml-anomalies.html">see documentation</a>), automatically detects anomalies by continuously modeling the normal behavior of your time series data — learning trends, periodicity, and more — in real time to identify anomalies, streamline root cause analysis, and reduce false positives. Anomaly detection runs in and scales with Elasticsearch and includes an intuitive UI.</p>
<p><strong>Log categorization:</strong> Using anomaly detection, Elastic also identifies patterns in your log events quickly. Instead of manually identifying similar logs, the logs categorization view lists log events that have been grouped, based on their messages and formats, so that you can take action quicker.</p>
<p><strong>High-latency or erroneous transactions:</strong> Elastic Observability’s APM capability helps you discover which attributes are contributing to increased transaction latency and identifies which attributes are most influential in distinguishing between transaction failures and successes. An overview of this capability is published here: <a href="https://www.elastic.co/blog/apm-correlations-elastic-observability-root-cause-transactions">APM correlations in Elastic Observability: Automatically identifying probable causes of slow or failed transactions</a>.</p>
<p><strong>AIOps Labs:</strong> AIOps Labs provides two main capabilities using advanced statistical methods:</p>
<ul>
<li><strong>Log spike detector</strong> helps identify reasons for increases in log rates. It makes it easy to find and investigate causes of unusual spikes by using the analysis workflow view. Examine the histogram chart of the log rates for a given data view, and find the reason behind a particular change possibly in millions of log events across multiple fields and values.</li>
<li><strong>Log pattern analysis</strong> helps you find patterns in unstructured log messages and makes it easier to examine your data. It performs categorization analysis on a selected field of a data view, creates categories based on the data, and displays them together with a chart that shows the distribution of each category and an example document that matches the category.</li>
</ul>
<p>_ <strong>In this blog, we will cover anomaly detection and log categorization against the popular “Hipster Shop app” developed by Google, and modified recently by OpenTelemetry.</strong> _</p>
<p>Overviews of high-latency capabilities can be found <a href="https://www.elastic.co/blog/apm-correlations-elastic-observability-root-cause-transactions">here</a>, and an overview of AIOps labs can be found <a href="https://www.youtube.com/watch?v=jgHxzUNzfhM&amp;list=PLhLSfisesZItlRZKgd-DtYukNfpThDAv_&amp;index=5">here</a>.</p>
<p>In this blog, we will examine a scenario where we use anomaly detection and log categorization to help identify a root cause of an issue in Hipster Shop.</p>
<h2>Prerequisites and config</h2>
<p>If you plan on following this blog, here are some of the components and details we used to set up this demonstration:</p>
<ul>
<li>Ensure you have an account on <a href="http://cloud.elastic.co">Elastic Cloud</a> and a deployed stack (<a href="https://www.elastic.co/guide/en/elastic-stack/current/installing-elastic-stack.html">see instructions here</a>) on AWS. Deploying this on AWS is required for Elastic Serverless Forwarder.</li>
<li>Utilize a version of the ever so popular <a href="https://github.com/GoogleCloudPlatform/microservices-demo">Hipster Shop</a> demo application. It was originally written by Google to showcase Kubernetes across a multitude of variants available, such as the <a href="https://github.com/open-telemetry/opentelemetry-demo">OpenTelemetry Demo App</a>. The Elastic version is found <a href="https://github.com/elastic/opentelemetry-demo">here</a>.</li>
<li>Ensure you have configured the app for either Elastic APM agents or OpenTelemetry agents. For more details, please refer to these two blogs: <a href="https://www.elastic.co/blog/opentelemetry-observability">Independence with OTel in Elastic</a> and <a href="https://www.elastic.co/blog/implementing-kubernetes-observability-security-opentelemetry">Observability and security with OTel in Elastic</a>. Additionally, review the <a href="https://www.elastic.co/guide/en/apm/guide/current/open-telemetry.html">OTel documentation in Elastic</a>.</li>
<li>Look through an overview of <a href="https://www.elastic.co/guide/en/observability/current/apm.html">Elastic Observability APM capabilities</a>.</li>
<li>Look through our <a href="https://www.elastic.co/guide/en/observability/8.5/inspect-log-anomalies.html">Anomaly detection documentation</a> for logs and <a href="https://www.elastic.co/guide/en/observability/8.5/categorize-logs.html">log categorization documentation</a>.</li>
</ul>
<p>Once you’ve instrumented your application with APM (Elastic or OTel) agents and are ingesting metrics and logs into Elastic Observability, you should see a service map for the application as follows:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/reduce-mttd-ml-machine-learning-observability/blog-elastic-service-map.png" alt="" /></p>
<p>In our example, we’ve introduced issues to help walk you through the root cause analysis features: anomaly detection and log categorization. You might have a different set of anomalies and log categorization depending on how you load the application and/or introduce specific issues.</p>
<p>As part of the walk-through, we’ll assume we are a DevOps or SRE managing this application in production.</p>
<h2>Root cause analysis</h2>
<p>While the application has been running normally for some time, you get a notification that some of the services are unhealthy. This can occur from the notification setting you’ve set up in Elastic or other external notification platforms (including customer related issues). In this instance, we’re assuming that customer support has called in multiple customer complaints about the website.</p>
<p>How do you as a DevOps or SRE investigate this? We will walk through two avenues in Elastic to investigate the issue:</p>
<ul>
<li>Anomaly detection</li>
<li>Log categorization</li>
</ul>
<p>While we show these two paths separately, they can be used in conjunction and are complementary, as they are both tools Elastic Observability provides to help you troubleshoot and identify a root cause.</p>
<h3>Machine learning for anomaly detection</h3>
<p>Elastic will detect anomalies based on historical patterns and identify a probability of these issues.</p>
<p>Starting with the service map, you can see anomalies identified with red circles and as we select them, Elastic will provide a score for the anomaly.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/reduce-mttd-ml-machine-learning-observability/blog-elastic-service-map-anomaly-detection.png" alt="" /></p>
<p>In this example, we can see that there is a score of 96 for a specific anomaly for the productCatalogService in the Hipster Shop application. An anomaly score indicates the significance of the anomaly compared to previously seen anomalies. More information on anomaly detection results can be found here. We can also dive deeper into the anomaly and analyze the details.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/reduce-mttd-ml-machine-learning-observability/blog-elastic-single-metric-viewer.png" alt="" /></p>
<p>What you will see for the productCatalogService is that there is a severe spike in average transaction latency time, which is the anomaly that was detected in the service map. Elastic’s machine learning has identified a specific metric anomaly (shown in the single metric view). It’s likely that customers are potentially responding to the slowness of the site and that the company is losing potential transactions.</p>
<p>One step to take next is to review all the other potential anomalies that we saw in the service map in a larger picture. Use an anomaly explorer to view all the anomalies that have been identified.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/reduce-mttd-ml-machine-learning-observability/blog-elastic-anomaly-explorer.png" alt="" /></p>
<p>Elastic is identifying numerous services with anomalies. productCatalogService has the highest score and a good number or others: frontend, checkoutService, advertService, and others, also have high scores. However, this analysis is looking at just one metric.</p>
<p>Elastic can help detect anomalies across all types of data, such as kubernetes data, metrics, and traces. If we analyze across all these types (via individual jobs we’ve created in Elastic machine learning), we will see a more comprehensive view as to what is potentially causing this latency issue.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/reduce-mttd-ml-machine-learning-observability/blog-elastic-anomaly-explorer-job-selection.png" alt="" /></p>
<p>Once all the potential jobs are selected and we’ve sorted by service.name, we can see that productCatalogService is still showing a high anomaly influencer score.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/reduce-mttd-ml-machine-learning-observability/blog-elastic-anomaly-explorer-timeline.png" alt="" /></p>
<p>In addition to the chart giving us a visual of the anomalies, we can review all the potential anomalies. As you will notice, Elastic has also categorized these anomalies (see category examples column). As we scroll through the results, we notice a potential postgreSQL issue from the categorization, which also has a high 94 score. Machine learning has identified a “rare mlcategory,” meaning that it has rarely occurred, hence pointing to a potential cause of the issue customers are seeing.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/reduce-mttd-ml-machine-learning-observability/blog-elastic-machine-learning-service-name.png" alt="" /></p>
<p>We also notice that this issue is potentially caused by pgbench , a popular postgreSQL tool to help benchmark the database. pgbench runs the same sequence of SQL commands over and over, possibly in multiple, concurrent database sessions. While pgbench is definitely a useful tool, it should not be used in a production environment as it causes heavy load on the database host, likely causing the higher latency issues on the site.</p>
<p>While this may or may not be the ultimate root cause, we have rather quickly identified a potentially issue that has a high probability of being the root cause. An engineer likely intended to run pgbench against a staging database to evaluate its performance, and not the production environment.</p>
<h3>Machine learning for log categorization</h3>
<p>Elastic Observability’s service map has detected an anomaly, and in this part of the walk-through, we take a different approach by investigating the service details from the service map versus initially exploring the anomaly. When we explore the service details for productCatalogService, we see the following:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/reduce-mttd-ml-machine-learning-observability/blog-elastic-product-catalog-service.png" alt="" /></p>
<p>The service details are identifying several things:</p>
<ol>
<li>There is an abnormally high latency compared to expected bounds of the service. We see that recently there was a higher than normal (upward of 1s latency) compared to the average to 275ms on average.</li>
<li>There is also a high failure rate for the same time frame as the high latency (lower left chart “ <strong>Failed transaction rate</strong> ”).</li>
<li>Additionally, we can see the transactions and one in particular /ListProduct has an abnormally high latency, in addition to a high failure rate.</li>
<li>We see productCatalogService has a dependency on postgreSQL.</li>
<li>We also see errors all related to postgreSQL.</li>
</ol>
<p>We have an option to dig through the logs and analyze in Elastic or we can use a capability to identify the logs more easily.</p>
<p>If we go to Categories under Logs in Elastic Observability and search for postgresql.logto help identify postgresql logs that could be causing this error, we see that Elastic’s machine learning has automatically categorized the postgresql logs.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/reduce-mttd-ml-machine-learning-observability/blog-elastic-categories.png" alt="" /></p>
<p>We notice two additional items:</p>
<ul>
<li>There is a high count category (message count of 23,797 with a high anomaly of 70) related to pgbench (which is odd to see in production). Hence we search further for all pgbench related logs in Categories .</li>
<li>We see an odd issue regarding terminating the connection (with a low count).</li>
</ul>
<p>While investigating the second error, which is severe, we can see logs from Categories before and after the error.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/reduce-mttd-ml-machine-learning-observability/blog-elastic-timestamp.png" alt="" /></p>
<p>This troubleshooting shows postgreSQL having a FATAL error, the database shutting down prior to the error, and all connections terminating. Given the two immediate issues we identified, we have an idea that someone was running pgbench and this potentially overloaded the database, causing the latency issue that customers are seeing.</p>
<p>The next steps here could be to investigate anomaly detection and/or work with the developers to review the code and identify pgbench as part of the deployed configuration.</p>
<h2>Conclusion</h2>
<p>I hope you’ve gotten an appreciation for how Elastic Observability can help you further identify and get closer to pinpointing root cause of issues without having to look for a “needle in a haystack.” Here’s a quick recap of lessons and what you learned:</p>
<ul>
<li>
<p>Elastic Observability has numerous capabilities to help you reduce your time to find root cause and improve your MTTR (even MTTD). In particular, we reviewed the following two main capabilities in this blog:</p>
<ol>
<li><strong>Anomaly detection:</strong> Elastic Observability, when turned on (<a href="https://www.elastic.co/guide/en/kibana/current/xpack-ml-anomalies.html">see documentation</a>), automatically detects anomalies by continuously modeling the normal behavior of your time series data — learning trends, periodicity, and more — in real time to identify anomalies, streamline root cause analysis, and reduce false positives. Anomaly detection runs in and scales with Elasticsearch and includes an intuitive UI.</li>
<li><strong>Log categorization:</strong> Using anomaly detection, Elastic also identifies patterns in your log events quickly. Instead of manually identifying similar logs, the logs categorization view lists log events that have been grouped based on their messages and formats so that you can take action quicker.</li>
</ol>
</li>
<li>
<p>You learned how easy and simple it is to use Elastic Observability’s log categorization and anomaly detection capabilities without having to understand machine learning (which help drive these features), nor having to do any lengthy setups.
Ready to get started? <a href="https://cloud.elastic.co/registration">Register for Elastic Cloud</a> and try out the features and capabilities I’ve outlined above.</p>
</li>
</ul>
<h3>Additional logging resources:</h3>
<ul>
<li><a href="https://www.elastic.co/getting-started/observability/collect-and-analyze-logs">Getting started with logging on Elastic (quickstart)</a></li>
<li><a href="https://www.elastic.co/guide/en/observability/current/logs-metrics-get-started.html">Ingesting common known logs via integrations (compute node example)</a></li>
<li><a href="https://docs.elastic.co/integrations">List of integrations</a></li>
<li><a href="https://www.elastic.co/blog/log-monitoring-management-enterprise">Ingesting custom application logs into Elastic</a></li>
<li><a href="https://www.elastic.co/blog/observability-logs-parsing-schema-read-write">Enriching logs in Elastic</a></li>
<li>Analyzing Logs with <a href="https://www.elastic.co/blog/reduce-mttd-ml-machine-learning-observability">Anomaly Detection (ML)</a> and <a href="https://www.elastic.co/blog/observability-logs-machine-learning-aiops">AIOps</a></li>
</ul>
<h3>Common use case examples with logs:</h3>
<ul>
<li><a href="https://youtu.be/ax04ZFWqVCg">Nginx log management</a></li>
<li><a href="https://www.elastic.co/blog/vpc-flow-logs-monitoring-analytics-observability">AWS VPC Flow log management</a></li>
<li><a href="https://www.elastic.co/blog/kubernetes-errors-elastic-observability-logs-openai">Using OpenAI to analyze Kubernetes errors</a></li>
<li><a href="https://youtu.be/Li5TJAWbz8Q">PostgreSQL issue analysis with AIOps</a></li>
</ul>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/reduce-mttd-ml-machine-learning-observability/illustration-machine-learning-anomaly-1680x980.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Live logs and prosper: fixing a fundamental flaw in observability]]></title>
            <link>https://www.elastic.co/observability-labs/blog/reimagine-observability-elastic-streams</link>
            <guid isPermaLink="false">reimagine-observability-elastic-streams</guid>
            <pubDate>Mon, 27 Oct 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Stop chasing symptoms. Learn how Streams, in Elastic Observability fixes the fundamental flaw in observability, using AI to proactively find the 'why' in your logs for faster resolution.]]></description>
            <content:encoded><![CDATA[<p>SREs are often overwhelmed by dashboards and alerts that show what and where things are broken, but fail to reveal why. This industry-wide focus on visualizing symptoms forces engineers to manually hunt for answers. The crucial &quot;why&quot; is buried in information-rich logs, but their massive volume and unstructured nature has led the industry to throw them aside or treat them like a second-class citizen. As a result, SREs are forced to turn every investigation into a high-stress, time-consuming hunt for clues. We can solve this problem with logs, but unlocking their potential requires us to reimagine how we work with them and improve the overall investigations journey.</p>
<h2>Observability, the broken promise</h2>
<p>To see why the current model fails, let’s look at the all-too-familiar challenge every SRE dreads: knowing a problem exists but needing to spend valuable time just trying to find where to even start the investigation.</p>
<p>Imagine you get a Slack message from the support team: &quot;a few high-value customers are reporting their payments are failing.&quot; You have no shortage of alerts, but most are just flagging symptoms. You don’t know where to start. You decide to check the logs to see if there is anything obvious, starting with the systems that have the high CPU alert.</p>
<p>You spend a few minutes searching and <code>grep</code>-ing through terabytes of logs for affected customer IDs, trying to piece together the problem. Nothing. You worry that you aren’t getting all the logs to reveal the problem, so you turn on more logging in the application. Now you’re knee-deep in data, desperately trying to find patterns, errors, or other &quot;hints&quot; that will give you a clue as to the <em>why</em>.</p>
<p>Finally, one of the broader log queries hits on an error code associated with an impacted customer ID. This is the first real clue. You pivot your search to this new error code and after an hour of digging, you finally uncover the error message. You've finally found the <em>why</em>, but it was a stressful, manual hunt that took far too much time and impacted dozens more customers.</p>
<p>This incident perfectly illustrates the broken promise of modern observability: The complete failure of the investigation process. Investigations are a manual, reactive process that SREs are forced into every day. At Elastic, we believe metrics, traces, and logs are all essential, but their roles, and the workflow between them, must be fundamentally re-imagined for effective investigations.</p>
<p>Observability is about having the clearest understanding possible of the <em>what</em>, <em>where</em>, and <em>why</em>. Metrics are essential for understanding the <em>what</em>. They are the heartbeat of your system, powering the dashboards and alerts that tell you when a threshold has been breached, like high CPU utilization or error rates. But they are aggregates; they show the symptom, rarely the root cause. Traces are good at identifying the <em>where</em>. They map the journey of a request through a distributed system, pinpointing the specific microservice or function where latency spikes or an error originates. Yet, their effectiveness hinges on complete and consistent code instrumentation, a constant dependency on development teams that can leave you with critical visibility gaps. Logs tell you the <em>why</em>. They contain all the rich, contextual, and unfiltered truth of an event. If we can more proactively and efficiently extract information from logs, we can greatly improve our overall understanding of our environments.</p>
<h2>Challenges of logs in modern environments</h2>
<p>While logs are in the standard toolbox, they have been neglected. SREs using today’s solutions deal with several major problems:</p>
<ul>
<li>First, due to their unstructured nature, it’s very difficult to parse and manage logs so that they’re useful. As a result, many SRE teams spend a lot of time building and maintaining complex pipelines to help manage this process. </li>
</ul>
<ul>
<li>Second, logs can get expensive at high volume, which leads teams to drop them on the floor to control costs, throwing away valuable information in the process. Consequently, when an incident occurs, you waste precious time hunting for the right logs, and manually correlating across services.</li>
</ul>
<ul>
<li>Finally, nobody has built a log solution that proactively works to find the important signals in logs and to surface those critical <em>whys</em> to you when you need them. As a result, log-based investigations are too painful and slow.</li>
</ul>
<p>Why are we here? As applications became more complex, log volume became unmanageable. Instead of solving this with automation, the industry took a shortcut: it gave up on getting the most out of logs and prioritized more manageable but less informative signals.</p>
<p>This decision is the origin of the broken, reactive model. It forced observability into a manual loop of 'observing' alerts, rather than building automation that could help us truly understand our systems to improve how we root cause and resolve issues. This has transformed SREs from investigators into full-time data wranglers, wrestling with Grok patterns and fragile ETL scripts instead of solving outages. </p>
<h2>Introducing Streams to rethink how you use logs for investigations</h2>
<p>Streams is an agentic AI solution that simplifies working with logs to help SRE teams rapidly understand the <em>why</em> behind an issue for faster resolution. The combination of Elasticsearch and AI is turning manual management of noisy logs into automated workflows that identify patterns, context, and meaning, marking a fundamental shift in observability.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/reimagine-observability-elastic-streams/streams-manifesto-01.png" alt="Streams" /></p>
<h4>Log everything in any format</h4>
<p>By applying the Elasticsearch platform for context engineering to bring together retrieval and AI-driven parsing to keep up with schema changes, we are reimagining the entire log pipeline.  </p>
<p>Streams ingests raw logs from all your sources to a single destination. It then uses AI to partition incoming logs into their logical components and parses them to extract relevant fields for an SRE to validate, approve, or modify. Imagine a world where you simply point your logs to a single endpoint, and everything just works. Less wrestling with Grok patterns, configuring processors, and hunting for the right plugin. All of which significantly reduces the complexity. Streams is a big step towards realizing that vision.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/reimagine-observability-elastic-streams/streams-manifesto-02.png" alt="Streams" /></p>
<p>As a result, SREs are freed from managing complex ingestion pipelines, allowing them to spend less time on data wrangling and more time preventing service disruptions.</p>
<h4>Solve incidents faster with Significant Events </h4>
<p>Significant Events, a capability within Streams, uses AI to automatically surface major errors and anomalies, enabling you to be proactive in your investigations. So, instead of just combing through endless noise, you can focus on the events that truly matter, such as startup and shutdown messages, out-of-memory errors, internal server failures, and other significant signals of change. These events act as actionable markers, giving SREs early warning and clear focus to begin an investigation before service impact.</p>
<p>With this new foundation, logs will become your primary signal for investigation. The panicked, manual search for a needle in a digital haystack is about to be over. Significant Events acts like a smart metal detector that sifts through the chaos and only beeps when it finds issues, helping you to easily ignore all that hay and find the &quot;needle&quot; faster. </p>
<p>Now imagine the same scenario we started with. Instead of starting a frantic, time-consuming grep through terabytes of logs. Streams has already done the heavy lifting. Its AI-driven analysis has detected a new, anomalous pattern that began before your support team even knew about it and automatically surfaced it as a significant event. Rather than you hunting for a clue, the clue finds you. </p>
<p>With a single click, you have the <em>why</em>: a Java out-of-memory error in a specific service component. This is your starting point. You find the root cause in under two minutes and begin remediation. The customer impact is stopped, the dev team gets the specific error, and the problem is contained before it can escalate. In this case, metrics and traces were unhelpful in finding the <em>why</em>. The answer was waiting in the logs all along.</p>
<p>This ideal outcome is possible because you can both afford to keep every log and instantly find the signal within them. Elastic's cost-efficient architecture with powerful compression, searchable snapshots, and data tiering makes full retention a reality. From there, Streams automatically surfaces the significant event, ensuring that the answer is never lost in the noise.</p>
<p>Elastic is the only company that provides an AI-driven log-first approach to elevate your observability signals and make it dramatically faster and easier to get to <em>why</em>. This is built on our decades of leadership in search, relevance, and powerful analytics that provides the foundation for understanding logs at a deep, semantic level.</p>
<h2>The vision for Streams </h2>
<p>The partitioning, parsing, and Significant Events you see today is just the starting point. The next step in our vision is to use the Significant Events to automatically generate critical SRE artifacts. Imagine Streams creating intelligent alerts, on-the-fly investigation dashboards, and even data-driven SLOs based <em>only</em> on the events that actually impact service health. From there, the goal is to use AI to drive automated Root Cause Analysis (RCA) directly from log patterns and generate remediation runbooks, turning a multi-hour hunt into an instant resolution recommendation.</p>
<p>Once this AI-drive log foundation is in place, our vision for Streams expands to become a unified intelligence layer that operates across all your telemetry data. It’s not just about making each signal better in isolation, but about understanding the context and relationships between them to solve complex problems. </p>
<p>For metrics, Streams won’t just alert you to a single metric spike but detect a correlated anomaly across multiple, seemingly unrelated metrics e.g. p99 latency for a specific service, rise in garbage collection time, transaction success rate.</p>
<p>Similarly, for traces it identifies a new, unexpected service call (e.g., a new database or an external API) appears in a critical transaction path after a deployment or identifies specific span is suddenly responsible for a majority of errors across all traces, even if the overall error rate hasn't breached a threshold.</p>
<p>The goal is not to have separate streams for logs, metrics, and traces, but to weave them into a single narrative that automatically correlates all three signals. Ultimately, Streams is about fundamentally changing the goal from human led data gathering exercise to proactive, AI-driven resolution.</p>
<p><em>For more on Streams:</em></p>
<p><em>Read the</em> <a href="https://www.elastic.co/observability-labs/blog/elastic-observability-streams-ai-logs-investigations"><em>Streams launch blog</em></a></p>
<p><em>Look at the</em> <a href="http://elastic.co/elasticsearch/streams"><em>Streams website</em></a></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/reimagine-observability-elastic-streams/streams-manifesto.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Build better Service Level Objectives (SLOs) from logs and metrics]]></title>
            <link>https://www.elastic.co/observability-labs/blog/service-level-objectives-slos-logs-metrics</link>
            <guid isPermaLink="false">service-level-objectives-slos-logs-metrics</guid>
            <pubDate>Fri, 23 Feb 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[To help manage operations and business metrics, Elastic Observability's SLO (Service Level Objectives) feature was introduced in 8.12. This blog reviews this feature and how you can use it with Elastic's AI Assistant to meet SLOs.]]></description>
            <content:encoded><![CDATA[<p>In today's digital landscape, applications are at the heart of both our personal and professional lives. We've grown accustomed to these applications being perpetually available and responsive. This expectation places a significant burden on the shoulders of developers and operations teams.</p>
<p>Site reliability engineers (SREs) face the challenging task of sifting through vast quantities of data, not just from the applications themselves but also from the underlying infrastructure. In addition to data analysis, they are responsible for ensuring the effective use and development of operational tools. The growing volume of data, the daily resolution of issues, and the continuous evolution of tools and processes can detract from the focus on business performance.</p>
<p>Elastic Observability offers a solution to this challenge. It enables SREs to integrate and examine all telemetry data (logs, metrics, traces, and profiling) in conjunction with business metrics. This comprehensive approach to data analysis fosters operational excellence, boosts productivity, and yields critical insights, all of which are integral to maintaining high-performing applications in a demanding digital environment.</p>
<p>To help manage operations and business metrics, Elastic Observability's SLO (Service Level Objectives) feature was introduced in <a href="https://www.elastic.co/guide/en/observability/8.12/slo.html">8.12</a>. This feature enables setting measurable performance targets for services, such as <a href="https://sre.google/sre-book/monitoring-distributed-systems/">availability, latency, traffic, errors, and saturation or define your own</a>. Key components include:</p>
<ul>
<li>
<p>Defining and monitoring SLIs (Service Level Indicators)</p>
</li>
<li>
<p>Monitoring error budgets indicating permissible performance shortfalls</p>
</li>
<li>
<p>Alerting on burn rates showing error budget consumption</p>
</li>
</ul>
<p>Users can monitor SLOs in real-time with dashboards, track historical performance, and receive alerts for potential issues. Additionally, SLO dashboard panels offer customized visualizations.</p>
<p>Service Level Objectives (SLOs) are generally available for our Platinum and Enterprise subscription customers.</p>
&lt;Video vidyardUuid=&quot;ngfY9mrkNEkjmpRY4Qd5Pb&quot; /&gt;
<p>In this blog, we will outline the following:</p>
<ul>
<li>
<p>What are SLOs? A Google SRE perspective</p>
</li>
<li>
<p>Several scenarios of defining and managing SLOs</p>
</li>
</ul>
<h2>Service Level Objective overview</h2>
<p>Service Level Objectives (SLOs) are a crucial component for Site Reliability Engineering (SRE), as detailed in <a href="https://sre.google/sre-book/table-of-contents/">Google's SRE Handbook</a>. They provide a framework for quantifying and managing the reliability of a service. The key elements of SLOs include:</p>
<ul>
<li>
<p><strong>Service Level Indicators (SLIs):</strong> These are carefully selected metrics, such as uptime, latency, throughput, error rates, or other important metrics, that represent the aspects of the service and are important from an operations or business perspective. Hence, an SLI is a measure of the service level provided (latency, uptime, etc.), and it is defined as a ratio of good over total events, with a range between 0% and 100%.</p>
</li>
<li>
<p><strong>Service Level Objective (SLO):</strong> An SLO is the target value for a service level measured as a percentage by an SLI. Above the threshold, the service is compliant. As an example, if we want to use service availability as an SLI, with the number of successful responses at 99.9%, then any time the number of failed responses is &gt; .1%, the SLO will be out of compliance.</p>
</li>
<li>
<p><strong>Error budget:</strong> This represents the threshold of acceptable errors, balancing the need for reliability with practical limits. It is defined as 100% minus the SLO quantity of errors that is tolerated.</p>
</li>
<li>
<p><strong>Burn rate:</strong> This concept relates to how quickly the service is consuming its error budget, which is the acceptable threshold for unreliability agreed upon by the service providers and its users.</p>
</li>
</ul>
<p>Understanding these concepts and effectively implementing them is essential for maintaining a balance between innovation and reliability in service delivery. For more detailed information, you can refer to <a href="https://sre.google/workbook/slo-document/">Google's SRE Handbook</a>.</p>
<p>One main thing to remember is that SLO monitoring is <em>not</em> incident monitoring. SLO monitoring is a proactive, strategic approach designed to ensure that services meet established performance standards and user expectations. It involves tracking Service Level Objectives, error budgets, and the overall reliability of a service over time. This predictive method helps in preventing issues that could impact users and aligns service performance with business objectives.</p>
<p>In contrast, incident monitoring is a reactive process focused on detecting, responding to, and mitigating service incidents as they occur. It aims to address unexpected disruptions or failures in real time, minimizing downtime and impact on service. This includes monitoring system health, errors, and response times during incidents, with a focus on rapid response to minimize disruption and preserve the service's reputation.</p>
<p>Elastic®’s SLO capability is based directly off the Google SRE Handbook. All the definitions and semantics are utilized as described in Google’s SRE handbook. Hence users can perform the following on SLOs in Elastic:</p>
<ul>
<li>
<p>Define an SLO on an SLI such as KQL (log based query), service availability, service latency, custom metric, histogram metric, or a timeslice metric. Additionally, set the appropriate threshold.</p>
</li>
<li>
<p>Utilize occurrence versus time slice based budgeting. Occurrences is the number of good events over the number of total events to compute the SLO. Timeslices break the overall time window into slammer slices of a defined duration and compute the number of good slices over the total slices to compute the SLO. Timeslice targets are more accurate and useful when calculating things like a service’s SLO when trying to meet agreed upon customer targets.</p>
</li>
<li>
<p>Manage all the SLOs in a singular location.</p>
</li>
<li>
<p>Trigger alerts from the defined SLO, whether the SLI is off, burn rate is used up, or the error rate is X.</p>
</li>
<li>
<p>Create unique service level dashboards with SLO information for a more comprehensive view of the service.</p>
</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/service-level-objectives-slos-logs-metrics/1-slo-blog.png" alt="Create alerts" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/service-level-objectives-slos-logs-metrics/2-slo-blog.png" alt="Create dashboards" /></p>
<p>SREs need to be able to manage business metrics.</p>
<h2>SLOs based on logs: NGINX availability</h2>
<p>Defining SLOs does not always mean metrics need to be used. Logs are a rich form of information, even when they have metrics embedded in them. Hence it’s useful to understand your business and operations status based on logs.</p>
<p>Elastic allows you to create an SLO based on specific fields in the log message, which don’t have to be metrics. A simple example is a simple multi-tier app that has a web server layer (nginx), a processing layer, and a database layer.</p>
<p>Let’s say that your processing layer is managing a significant number of requests. You want to ensure that the service is up properly. The best way is to ensure that all http.response.status_code are less than 500. Anything less ensures the service is up and any errors (like 404) are all user or client errors versus server errors.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/service-level-objectives-slos-logs-metrics/3-slo-blog.png" alt="expanded document" /></p>
<p>If we use Discover in Elastic, we see that there are close to 2M log messages over a seven-day time frame.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/service-level-objectives-slos-logs-metrics/4-slo-blog.png" alt="17k" /></p>
<p>Additionally, the number of messages with http.response.status_code &gt; 500 is minimal, like 17K.</p>
<p>Rather than creating an alert, we can create an SLO with this query:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/service-level-objectives-slos-logs-metrics/5-slo-blog.png" alt="edit SLO" /></p>
<p>We chose to use occurrences as the budgeting method to keep things simple.</p>
<p>Once defined, we can see how well our SLO is performing over a seven-day time frame. We can see not only the SLO, but also the burn rate, the historical SLI, and error budget, and any specific alerts against the SLO.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/service-level-objectives-slos-logs-metrics/6-slo-blog.png" alt="SLOs" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/service-level-objectives-slos-logs-metrics/7-slo-blog.png" alt="nginx server availability " /></p>
<p>Not only do we get information about the violation, but we also get:</p>
<ul>
<li>
<p>Historical SLI (7 days)</p>
</li>
<li>
<p>Error budget burn down</p>
</li>
<li>
<p>Good vs. bad events (24 hours)</p>
</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/service-level-objectives-slos-logs-metrics/8-slo-blog.png" alt="Percentages" /></p>
<p>We can see how we’ve easily burned through our error budget.</p>
<p>Hence something must be going on with nginx. To investigate, all we need to do is utilize the <a href="https://www.elastic.co/blog/context-aware-insights-elastic-ai-assistant-observability">AI Assistant</a>, and use its natural language interface to ask questions to help analyze the situation.</p>
<p>Let’s use Elastic’s AI Assistant to analyze the breakdown of http.response.status_code across all the logs from the past seven days. This helps us understand how many 50X errors we are getting.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/service-level-objectives-slos-logs-metrics/9-slo-blog.png" alt="count of http response status code" /></p>
<p>As we can see, the number of 502s is minimal compared to the number of overall messages, but it is affecting our SLO.</p>
<p>However, it seems like Nginx is having an issue. In order to reduce the issue, we also ask the AI Assistant how to work on this error. Specifically, we ask if there is an internal runbook the SRE team has created.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/service-level-objectives-slos-logs-metrics/10-slo-blog.png" alt="ai assistant thread" /></p>
<p>AI Assistant gets a runbook the team has added to its knowledge base. I can now analyze and try to resolve or reduce the issue with nginx.</p>
<p>While this is a simple example, there are an endless number of possibilities that can be defined based on KQL. Some other simple examples:</p>
<ul>
<li>
<p>99% of requests occur under 200ms</p>
</li>
<li>
<p>99% of log message are not errors</p>
</li>
</ul>
<h2>Application SLOs: OpenTelemetry demo cartservice</h2>
<p>A common application developers and SREs use to learn about OpenTelemetry and test out Observability features is the <a href="https://github.com/elastic/opentelemetry-demo">OpenTelemetry demo</a>.</p>
<p>This demo has <a href="https://opentelemetry.io/docs/demo/feature-flags/">feature flags</a> to simulate issues. With Elastic’s alerting and SLO capability, you can also determine how well the entire application is performing and how well your customer experience is holding up when these feature flags are used.</p>
<p><a href="https://www.elastic.co/blog/opentelemetry-observability">Elastic supports OpenTelemetry by taking OTLP directly with no need for an Elastic specific agent</a>. You can send in OpenTelemetry data directly from the application (through OTel libraries) and through the collector.</p>
<p>We’ve brought up the OpenTelemetry demo on a K8S cluster (AWS EKS) and turned on the cartservice feature flag. This inserts errors into the cartservice. We’ve also created two SLOs to monitor the cartservice’s availability and latency.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/service-level-objectives-slos-logs-metrics/11-slo-blog.png" alt="SLOs" /></p>
<p>We can see that the cartservice’s availability is violated. As we drill down, we see that there aren’t as many successful transactions, which is affecting the SLO.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/service-level-objectives-slos-logs-metrics/12-slo-blog.png" alt="cartservice-otel" /></p>
<p>As we drill into the service, we can see in Elastic APM that there is a higher than normal failure rate of about 5.5% for the emptyCart service.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/service-level-objectives-slos-logs-metrics/13-slo-blog.png" alt="apm" /></p>
<p>We can investigate this further in APM, but that is a discussion for another blog. Stay tuned to see how we can use Elastic’s machine learning, AIOps, and AI Assistant to understand the issue.</p>
<h2>Conclusion</h2>
<p>SLOs allow you to set clear, measurable targets for your service performance, based on factors like availability, response times, error rates, and other key metrics. Hopefully with the overview we’ve provided in this blog, you can see that:</p>
<ul>
<li>
<p>SLOs can be based on logs. In Elastic, you can use KQL to essentially find and filter on specific logs and log fields to monitor and trigger SLOs.</p>
</li>
<li>
<p>AI Assistant is a valuable, easy-to-use capability to analyze, troubleshoot, and even potentially resolve SLO issues.</p>
</li>
<li>
<p>APM Service based SLOs are easy to create and manage with integration to Elastic APM. We also use OTel telemetry to help monitor SLOs.</p>
</li>
</ul>
<p>For more information on SLOs in Elastic, check out <a href="https://www.elastic.co/guide/en/observability/current/slo.html">Elastic documentation</a> and the following resources:</p>
<ul>
<li>
<p><a href="https://www.elastic.co/guide/en/observability/8.12/slo.html">What’s new in Elastic Observability 8.12</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/blog/context-aware-insights-elastic-ai-assistant-observability">Introducing the Elastic AI Assistant</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/blog/opentelemetry-observability">Elastic OpenTelemetry support</a></p>
</li>
</ul>
<p>Ready to get started? Sign up for <a href="https://cloud.elastic.co/registration">Elastic Cloud</a> and try out the features and capabilities I’ve outlined above to get the most value and visibility out of your SLOs.</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
<p><em>In this blog post, we may have used or referred to third party generative AI tools, which are owned and operated by their respective owners. Elastic does not have any control over the third party tools and we have no responsibility or liability for their content, operation or use, nor for any loss or damage that may arise from your use of such tools. Please exercise caution when using AI tools with personal, sensitive or confidential information. Any data you submit may be used for AI training or other purposes. There is no guarantee that information you provide will be kept secure or confidential. You should familiarize yourself with the privacy practices and terms of use of any generative AI tools prior to use.</em></p>
<p><em>Elastic, Elasticsearch, ESRE, Elasticsearch Relevance Engine and associated marks are trademarks, logos or registered trademarks of Elasticsearch N.V. in the United States and other countries. All other company and product names are trademarks, logos or registered trademarks of their respective owners.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/service-level-objectives-slos-logs-metrics/139686_-_Elastic_-_Headers_-_V1_3.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Simplifying log data management: Harness the power of flexible routing with Elastic]]></title>
            <link>https://www.elastic.co/observability-labs/blog/simplifying-log-data-management-flexible-routing</link>
            <guid isPermaLink="false">simplifying-log-data-management-flexible-routing</guid>
            <pubDate>Tue, 13 Jun 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[The reroute processor, available as of Elasticsearch 8.8, allows customizable rules for routing documents, such as logs, into data streams for better control of processing, retention, and permissions with examples that you can try on your own.]]></description>
            <content:encoded><![CDATA[<p>In Elasticsearch 8.8, we’re introducing the <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/reroute-processor.html">reroute processor</a> in technical preview that makes it possible to send documents, such as logs, to different <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/data-streams.html">data streams</a>, according to flexible routing rules. When using Elastic Observability, this gives you more granular control over your data with regard to retention, permissions, and processing with all the potential benefits of the <a href="https://www.elastic.co/blog/an-introduction-to-the-elastic-data-stream-naming-scheme">data stream naming scheme</a>. While optimized for data streams, the reroute processor also works with classic indices. This blog post contains examples on how to use the reroute processor that you can try on your own by executing the snippets in the <a href="https://www.elastic.co/guide/en/kibana/current/console-kibana.html">Kibana dev tools</a>.</p>
<p>Elastic Observability offers a wide range of <a href="https://www.elastic.co/integrations/data-integrations?solution=observability">integrations</a> that help you to monitor your applications and infrastructure. These integrations are added as policies to <a href="https://www.elastic.co/guide/en/fleet/current/elastic-agent-installation.html">Elastic agents</a>, which help ingest telemetry into Elastic Observability. Several examples of these integrations include the ability to ingest logs from systems that send a stream of logs from different applications, such as <a href="https://www.elastic.co/guide/en/kinesis/current/aws-firehose-setup-guide.html">Amazon Kinesis Data Firehose</a>, <a href="https://docs.elastic.co/en/integrations/kubernetes">Kubernetes container logs</a>, and <a href="https://docs.elastic.co/integrations/tcp">syslog</a>. One challenge is that these multiplexed log streams are sending data to the same Elasticsearch data stream, such as logs-syslog-default. This makes it difficult to create parsing rules in ingest pipelines and dashboards for specific technologies, such as the ones from the <a href="https://docs.elastic.co/en/integrations/nginx">Nginx</a> and <a href="https://docs.elastic.co/en/integrations/apache">Apache</a> integrations. That’s because in Elasticsearch, in combination with the <a href="https://www.elastic.co/blog/an-introduction-to-the-elastic-data-stream-naming-scheme">data stream naming scheme</a>, the processing and the schema are both encapsulated in a data stream.</p>
<p>The reroute processor helps you tease apart data from a generic data stream and send it to a more specific one. You may use that mechanism to send logs to a data stream that is set up by the Nginx integration, for example, so that the logs are parsed with that integration and you can use the integration’s prebuilt dashboards or create custom ones with the fields, such as the url, the status code, and the response time that the Nginx pipeline has parsed out of the Nginx log message. You can also split out/separate regular Nginx logs and errors with the reroute processor, providing further separation ability and categorization of logs.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/simplifying-log-data-management-flexible-routing/blog-elastic-routing-pipeline.png" alt="routing pipeline" /></p>
<h2>Example use case</h2>
<p>To use the reroute processor, first:</p>
<ol>
<li>
<p>Ensure you are on Elasticsearch 8.8</p>
</li>
<li>
<p>Ensure you have permissions to manage indices and data streams</p>
</li>
<li>
<p>If you don’t already have an account on <a href="https://cloud.elastic.co/registration?fromURI=/home">Elastic Cloud</a>, sign up for one</p>
</li>
</ol>
<p>Next, you’ll need to <a href="https://www.elastic.co/guide/en/elasticsearch/reference/master/set-up-a-data-stream.html">set up a data stream</a> and create a custom Elasticsearch <a href="https://www.elastic.co/guide/en/elasticsearch/reference/master/ingest.html">ingest pipeline</a> that is called as the <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/ingest.html#set-default-pipeline">default pipeline</a>. Below we go through this step by step for the “mydata” data set that we’ll simulate ingesting container logs into. We start with a basic example and extend it from there.</p>
<p>The following steps should be utilized in the Elastic console, which is found at <strong>Management -&gt; Dev tools -&gt; Console</strong>. First, we need an an ingest pipeline and a template for the data stream:</p>
<pre><code class="language-bash">PUT _ingest/pipeline/logs-mydata
{
  &quot;description&quot;: &quot;Routing for mydata&quot;,
  &quot;processors&quot;: [
    {
      &quot;reroute&quot;: {
      }
    }
  ]
}
</code></pre>
<p>This creates an ingest pipeline with an empty reroute processor. To make use of it, we need an index template:</p>
<pre><code class="language-bash">PUT _index_template/logs-mydata
{
  &quot;index_patterns&quot;: [
    &quot;logs-mydata-*&quot;
  ],
  &quot;data_stream&quot;: {},
  &quot;priority&quot;: 200,
  &quot;template&quot;: {
    &quot;settings&quot;: {
      &quot;index.default_pipeline&quot;: &quot;logs-mydata&quot;
    },
    &quot;mappings&quot;: {
      &quot;properties&quot;: {
        &quot;container.name&quot;: {
          &quot;type&quot;: &quot;keyword&quot;
        }
      }
    }
  }
}
</code></pre>
<p>The above template is applied to all data that is shipped to logs-mydata-*. We have mapped container.name as a keyword, as this is the field we will be using for routing later on. Now, we send a document to the data stream and it will be ingested into logs-mydata-default:</p>
<pre><code class="language-bash">POST logs-mydata-default/_doc
{
  &quot;@timestamp&quot;: &quot;2023-05-25T12:26:23+00:00&quot;,
  &quot;container&quot;: {
    &quot;name&quot;: &quot;foo&quot;
  }
}
</code></pre>
<p>We can check that it was ingested with the command below, which will show 1 result.</p>
<pre><code class="language-bash">GET logs-mydata-default/_search
</code></pre>
<p>Without modifying the routing processor, this already allows us to route documents. As soon as the reroute processor is specified, it will look for data_stream.dataset and data_stream.namespace fields by default and will send documents to the corresponding data stream, according to the <a href="https://www.elastic.co/blog/an-introduction-to-the-elastic-data-stream-naming-scheme">data stream naming scheme</a> logs-&lt;dataset&gt;-&lt;namespace&gt;. Let’s try this out:</p>
<pre><code class="language-bash">POST logs-mydata-default/_doc
{
  &quot;@timestamp&quot;: &quot;2023-03-30T12:27:23+00:00&quot;,
  &quot;container&quot;: {
&quot;name&quot;: &quot;foo&quot;
  },
  &quot;data_stream&quot;: {
    &quot;dataset&quot;: &quot;myotherdata&quot;
  }
}
</code></pre>
<p>As can be seen with the GET logs-mydata-default/_search command, this document ended up in the logs-myotherdata-default data stream. But instead of using default rules, we want to create our own rules for the field container.name. If the field is container.name = foo, we want to send it to logs-foo-default. For this we modify our routing pipeline:</p>
<pre><code class="language-bash">PUT _ingest/pipeline/logs-mydata
{
  &quot;description&quot;: &quot;Routing for mydata&quot;,
  &quot;processors&quot;: [
    {
      &quot;reroute&quot;: {
        &quot;tag&quot;: &quot;foo&quot;,
        &quot;if&quot; : &quot;ctx.container?.name == 'foo'&quot;,
        &quot;dataset&quot;: &quot;foo&quot;
      }
    }
  ]
}
</code></pre>
<p>Let's test this with a document:</p>
<pre><code class="language-bash">POST logs-mydata-default/_doc
{
  &quot;@timestamp&quot;: &quot;2023-05-25T12:26:23+00:00&quot;,
  &quot;container&quot;: {
    &quot;name&quot;: &quot;foo&quot;
  }
}
</code></pre>
<p>While it would be possible to specify a routing rule for each container name, you can also route by the value of a field in the document:</p>
<pre><code class="language-bash">PUT _ingest/pipeline/logs-mydata
{
  &quot;description&quot;: &quot;Routing for mydata&quot;,
  &quot;processors&quot;: [
    {
      &quot;reroute&quot;: {
        &quot;tag&quot;: &quot;mydata&quot;,
        &quot;dataset&quot;: [
          &quot;{{container.name}}&quot;,
          &quot;mydata&quot;
        ]
      }
    }
  ]
}
</code></pre>
<p>In this example, we are using a field reference as a routing rule. If the container.name field exists in the document, it will be routed — otherwise it falls back to mydata. This can be tested with:</p>
<pre><code class="language-bash">POST logs-mydata-default/_doc
{
  &quot;@timestamp&quot;: &quot;2023-05-25T12:26:23+00:00&quot;,
  &quot;container&quot;: {
    &quot;name&quot;: &quot;foo1&quot;
  }
}

POST logs-mydata-default/_doc
{
  &quot;@timestamp&quot;: &quot;2023-05-25T12:26:23+00:00&quot;,
  &quot;container&quot;: {
    &quot;name&quot;: &quot;foo2&quot;
  }
}
</code></pre>
<p>This creates the data streams logs-foo1-default and logs-foo2-default.</p>
<p><em>NOTE: There is currently a limitation in the processor that requires the fields specified in a <code>{{field.reference}}</code> to be in a nested object notation. A dotted field name does not currently work. Also, you’ll get errors when the document contains dotted field names for any</em> <em>data_stream.*</em> <em>field. This limitation will be</em> <a href="https://github.com/elastic/elasticsearch/pull/96243"><em>fixed</em></a> <em>in 8.8.2 and 8.9.0.</em></p>
<h2>API keys</h2>
<p>When using the reroute processor, it is important that the API keys specified have permissions for the source and target indices. For example, if a pattern is used for routing from logs-mydata-default, the API key must have write permissions for <code>logs-*-*</code> as data could end up in any of these indices (see example further down).</p>
<p>We’re currently <a href="https://github.com/elastic/integrations/issues/5989">working</a> <a href="https://github.com/elastic/integrations/issues/6255">on</a> extending the API key permissions for our <a href="https://www.elastic.co/integrations/data-integrations">integrations</a> so that they allow for routing by default if you’re running a Fleet-managed Elastic Agent.</p>
<p>If you’re using a standalone Elastic Agent, or any other shipper, you can use this as a template to create your API key:</p>
<pre><code class="language-bash">POST /_security/api_key
{
  &quot;name&quot;: &quot;ingest_logs&quot;,
  &quot;role_descriptors&quot;: {
    &quot;ingest_logs&quot;: {
      &quot;cluster&quot;: [
        &quot;monitor&quot;
      ],
      &quot;indices&quot;: [
        {
          &quot;names&quot;: [
            &quot;logs-*-*&quot;
          ],
          &quot;privileges&quot;: [
            &quot;auto_configure&quot;,
            &quot;create_doc&quot;
          ]
        }
      ]
    }
  }
}
</code></pre>
<h2>Future plans</h2>
<p>In Elasticsearch 8.8, the reroute processor was released in technical preview. The plan is to adopt this in our data sink integrations like syslog, k8s, and others. Elastic will provide default routing rules that just work out of the box, but it will also be possible for users to add their own rules. If you are using our integrations, follow <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/ingest.html#pipelines-for-fleet-elastic-agent">this guide</a> on how to add a custom ingest pipeline.</p>
<h2>Try it out!</h2>
<p>This blog post has shown some sample use cases for document based routing. Try it out on your data by adjusting the commands for index templates and ingest pipelines to your own data, and get started with <a href="https://cloud.elastic.co/registration?fromURI=/home">Elastic Cloud</a> through a 7-day free trial. Let us know via <a href="https://ela.st/reroute-feedback">this feedback form</a> how you’re planning to use the <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/reroute-processor.html">reroute processor</a> and whether you have suggestions for improvement.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/simplifying-log-data-management-flexible-routing/observability-digital-transformation-1.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[How Streams in Elastic Observability Simplifies Retention Management]]></title>
            <link>https://www.elastic.co/observability-labs/blog/simplifying-retention-management-with-streams</link>
            <guid isPermaLink="false">simplifying-retention-management-with-streams</guid>
            <pubDate>Thu, 30 Oct 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how Streams simplifies retention management in Elasticsearch with a unified view to monitor, visualize, and control data lifecycles using DSL or ILM.]]></description>
            <content:encoded><![CDATA[<p>Managing retention in Elasticsearch can get complicated fast. Between <a href="https://www.elastic.co/docs/manage-data/lifecycle/data-stream">Data stream lifecycle (DSL)</a>, <a href="https://www.elastic.co/docs/manage-data/lifecycle/index-lifecycle-management">Index lifecycle management (ILM)</a>, templates, and individual index settings, keeping policies consistent across data streams often takes more effort than it should.</p>
<p><strong>Streams</strong> changes that. It introduces a clear, unified way to manage how long your data lives, whether you’re using DSL or ILM. You can visualize ingestion, understand where data sits across tiers, and adjust retention with confidence, applying updates to a single stream without worrying about unintended changes elsewhere, all from a single view.</p>
<h3>Walkthrough: Exploring the Retention Tab</h3>
<p><img src="https://www.elastic.co/observability-labs/assets/images/simplifying-retention-management-with-streams/retention_view.png" alt="Retention view of a stream" /></p>
<p>Retention management lives in the <strong>Retention</strong> tab of each stream. This is your control panel for understanding how much data you’re storing, how quickly it’s growing, and how your lifecycle policies are applied. It’s also where you can monitor and configure the <a href="https://www.elastic.co/docs/manage-data/data-store/data-streams/failure-store">Failure store</a>, which tracks and retains documents that failed to be ingested.</p>
<h4>Metrics at a glance</h4>
<p>At the top of the view, you’ll find an overview of key metrics:</p>
<ul>
<li>Storage size: the total data volume currently held by the stream.</li>
<li>Ingestion averages: calculated from the selected time range, Streams extrapolates both daily and monthly averages to give you a sense of long-term trends.</li>
</ul>
<p>This combination of near-real-time and projected values helps you quickly spot when ingestion is ramping up and whether your retention policy aligns with it.</p>
<h4>Ingestion over time</h4>
<p>Below the metrics, a graph shows ingestion volume over time. This information is approximated based on the number of documents over time, multiplied by the average document size in the backing index. </p>
<h4>Visualizing lifecycle phases</h4>
<p>When an ILM policy is effective, the retention view becomes more visual. Streams displays a phase breakdown (hot, warm, cold, frozen) showing the data volume stored in each phase. This gives you a clear sense of how your data is distributed across the storage tiers and whether your lifecycle is doing what you expect.</p>
<h4>Failure store</h4>
<p>A failure store is a secondary set of indices inside a data stream, dedicated to storing documents that failed to be ingested. Within the Retention tab, you can toggle the Failure store on or off, and configure its own retention period.
We’ll cover Failure store and Data quality in more detail in <a href="https://www.elastic.co/observability-labs/blog/data-quality-and-failure-store-in-streams">this article</a>.</p>
<h3>Updating Retention</h3>
<p>Beyond visualizing your retention, Streams makes it easy to change how it’s managed.</p>
<h4>Switching between DSL and ILM</h4>
<p>You can freely switch a stream between DSL and ILM management, or update a DSL retention  with just a few clicks. Streams takes care of updating the lifecycle settings at the data stream level, ensuring consistent retention across all existing backing indices, not just new ones.</p>
<p>Whether you prefer the simplicity of DSL or the fine-grained tiering of ILM, you can move between the two seamlessly.</p>
<p><em>Clicking “Edit data retention” opens a modal that allows you to update the stream’s configuration. From there you can update the ILM policy or set a custom retention period via DSL.</em>
<img src="https://www.elastic.co/observability-labs/assets/images/simplifying-retention-management-with-streams/edit_ilm.png" alt="Modal view to set a lifecycle policy" /></p>
<p><em>You can set a custom period, or pick an Indefinite retention for your data.</em>
<img src="https://www.elastic.co/observability-labs/assets/images/simplifying-retention-management-with-streams/edit_dsl.png" alt="Modal view to set a custom retention period" /></p>
<p><em>You can also update streams’ lifecycle via the <a href="https://www.elastic.co/docs/api/doc/kibana/operation/operation-put-streams-name">Upsert stream</a> or the <a href="https://www.elastic.co/docs/api/doc/kibana/operation/operation-put-streams-name-ingest">Update ingest stream settings</a> Kibana APIs.</em></p>
<h4>Inherit or defer: different strategies for different stream types</h4>
<p><strong>Classic streams</strong></p>
<p>For classic streams, you can default to the existing index template’s retention. Retention isn’t managed by Streams in this case, it follows the lifecycle configuration defined in the template just as it normally would.</p>
<p>This option is useful if you’re onboarding existing data streams and want to keep their lifecycle behavior intact while still benefiting from Streams’ visibility and monitoring features.</p>
<p><strong>Wired streams</strong></p>
<p>Wired streams live in a tree structure, and that hierarchy allows an inheritance model.</p>
<p>A child stream can inherit the lifecycle of its nearest ancestor that has a concrete policy (ILM or DSL). This keeps your configuration lean and consistent since you can set a single lifecycle at a higher level in the tree and let Streams automatically apply it to all relevant descendants.</p>
<p>If that ancestor’s lifecycle is later updated, Streams cascades the change down to all children that inherit it, so everything stays in sync.</p>
<p><em>In the figure below, we set a different retention for</em> <strong><em>logs.prod</em></strong> <em>and</em> <strong><em>logs.staging</em></strong> <em>environments. The child partitions of these environments automatically inherit the configuration.</em>
<img src="https://www.elastic.co/observability-labs/assets/images/simplifying-retention-management-with-streams/streams_tree.png" alt="A streams tree that shows inheritance" /></p>
<h4>How it works under the hood</h4>
<p>When you apply or update a lifecycle, <strong>Streams</strong> calls Elasticsearch’s <a href="https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-indices-put-data-stream-settings">/_data_stream/_settings</a>. This is a new API we’ve added in 8.19 / 9.1 for this purpose. </p>
<p>This API is key to keeping retention consistent:</p>
<ol>
<li>It applies the lifecycle directly at the data stream level, overriding any configuration from cluster settings or index templates.</li>
<li>It propagates the retention update to all existing backing indices, not just new ones, so retention remains uniform across your historical and future data.</li>
</ol>
<p>By centralizing lifecycle management at the data stream level and applying a consistent configuration across the backing indices, we remove the ambiguity that used to exist between template-level and index-level configurations. You always know which retention policy is actually in effect, and you can see it directly in the UI.</p>
<h3>Wrapping Up</h3>
<p>With Streams, retention management becomes clear and consistent. You can visualize ingestion, switch between DSL and ILM, or inherit policies across streams, all without diving into templates or manual index settings.</p>
<p>By unifying retention into a single view, Streams turns lifecycle management into something simple, predictable, and transparent.</p>
<p>Sign up for an Elastic trial at <a href="http://cloud.elastic.co">cloud.elastic.co</a>, and trial Elastic's Serverless offering which will allow you to play with all of the Streams functionality.</p>
<p>Additionally, check out:</p>
<p><em>Read about</em> <a href="https://www.elastic.co/observability-labs/blog/reimagine-observability-elastic-streams"><em>Reimagining streams</em></a></p>
<p><em>Look at the</em> <a href="http://elastic.co/elasticsearch/streams"><em>Streams website</em></a></p>
<p><em>Read the</em> <a href="https://www.elastic.co/docs/solutions/observability/streams/streams"><em>Streams documentation</em></a></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/simplifying-retention-management-with-streams/article.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Smarter log analytics in Elastic Observability]]></title>
            <link>https://www.elastic.co/observability-labs/blog/smarter-log-analytics-in-elastic-observability</link>
            <guid isPermaLink="false">smarter-log-analytics-in-elastic-observability</guid>
            <pubDate>Mon, 10 Jun 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Discover smarter log handling with Kibana's latest features! The new Data Source Selector lets you easily filter logs by integrations like System Logs and Nginx. Smart Fields enhance log analysis by presenting data more intuitively. Simplify your workflow and uncover deeper insights today!]]></description>
            <content:encoded><![CDATA[<p>Discover a smarter way to handle your logs with Kibana's latest features! Our new Data Source selector makes it effortless to zero in on the logs you need, whether they're from System Logs or Application Logs by selecting your integrations or data views. Plus, with the introduction of Smart Fields, your log analysis is now more intuitive and insightful. Get ready to simplify your workflow and uncover deeper insights with these game-changing updates. Dive in and see how easy log exploration can be!</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/smarter-log-analytics-in-elastic-observability/smart-fields.png" alt="Smart fields" /></p>
<h2>Find the logs you’re looking for</h2>
<h3>Focus on logs from specific integrations or data views</h3>
<p>We've added the Data Source selector, a handy new feature for viewing specific logs. Now, you can easily filter your logs based on your integrations, like System Logs, Nginx, or Elastic APM, or switch between different data views, like logs or metrics. This new selector is all about making your data easier to find and helping you focus on what matters most in your analysis.</p>
<h2>Dive into your logs</h2>
<h3>Analyze logs with Smart Fields in Kibana</h3>
<p>Logs in Kibana have undergone a significant transformation, particularly in the way log data is presented. The once-basic table view has evolved with the introduction of Smart Fields, providing users with a more insightful and dynamic log analysis experience.</p>
<h4>Resource Smart Field - centralizing log source information</h4>
<p>The resource column further elevates the Logs Explorer page by providing users with a single column for exploring the resource that created the log event. This column groups various resource-indicating fields together, streamlining the investigation process. Currently, the following <a href="https://www.elastic.co/guide/en/ecs/current/ecs-reference.html">ECS</a> fields are grouped under this single column and we recommend including them in your logs:</p>
<ul>
<li><a href="https://www.elastic.co/guide/en/ecs/current/ecs-service.html#field-service-name">service.name</a></li>
<li><a href="https://www.elastic.co/guide/en/ecs/current/ecs-container.html#field-container-name">container.name</a></li>
<li><a href="https://www.elastic.co/guide/en/ecs/current/ecs-orchestrator.html#field-orchestrator-namespace">orchestrator.namespace</a></li>
<li><a href="https://www.elastic.co/guide/en/ecs/current/ecs-host.html#field-host-name">host.name</a></li>
<li><a href="https://www.elastic.co/guide/en/ecs/current/ecs-cloud.html#field-cloud-instance-id">cloud.instance.id</a></li>
</ul>
<p>We know this does not include all use cases and would like your feedback on other fields you use/are important for you to help us provide a tailored and user-centric log analysis experience.</p>
<h4>Content Smart Field - a deeper dive into log data</h4>
<p>The content column revolutionizes log analysis by seamlessly rendering <strong>log.level</strong> and <strong>message</strong> fields. Notably, it automatically handles fallbacks, ensuring a smooth transition when the actual message field is not available. This enhancement simplifies the log exploration process, offering users a more comprehensive understanding of their data.</p>
<h4>Actions column - unleashing additional columns</h4>
<p>As part of our commitment to empowering users, we are introducing the actions column, adding a layer of functionality to the document table. This column includes two powerful actions:</p>
<ul>
<li><strong>Degraded document indicator</strong>: This indicator provides insights about the quality of your data by indicating fields were ignored when the document was indexed and ended up in the <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-ignored-field.html">_ignored</a> property of the document. To help analyze what caused the document to degrade, we suggest reading this blog - <a href="https://www.elastic.co/observability-labs/blog/antidote-index-mapping-exceptions-ignore-malformed">The antidote for index mapping exceptions: ignore_malformed</a>.</li>
<li><strong>Stacktrace indicator</strong>: This indicator informs users of the presence of stack traces in the document. This makes it easy to navigate through logs documents and know if they have additional information.</li>
</ul>
<h3>Investigate individual logs by expanding log details</h3>
<p>Now, when you click the expand icon in the actions column, it opens up the <strong>Log details</strong> flyout for any log entry. This new feature gives you a detailed overview of the entry right at your fingertips. Inside the flyout, the <strong>Overview</strong> tab is neatly organized into four sections—Content breakdown, Service &amp; Infrastructure, Cloud, and Others—each offering a snapshot of the most crucial information. Plus, you'll find the same handy controls you're used to in the main table, like filtering in or out, adding or removing columns, and copying data, making it easier than ever to manage your logs directly from the flyout.</p>
<p>The <a href="https://www.elastic.co/guide/en/observability/current/obs-ai-assistant.html">Observability AI Assistant</a> is fully integrated into this view providing contextual insights about the log event and helping to find similar messages.</p>
<h2>Experience a streamlined approach to log exploration</h2>
<p>These enhancements simplify the process of finding and focusing on specific logs and offer more intuitive and insightful data presentation. Dive into your logs with these I tools and streamline your workflow, uncovering deeper insights with ease. Try it now and transform your log analysis!</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/smarter-log-analytics-in-elastic-observability/log-monitoring.jpeg" length="0" type="image/jpeg"/>
        </item>
        <item>
            <title><![CDATA[SNMP Topology Data in Kibana: Collection to Canvas]]></title>
            <link>https://www.elastic.co/observability-labs/blog/snmp-topology-data-kibana-collection-canvas</link>
            <guid isPermaLink="false">snmp-topology-data-kibana-collection-canvas</guid>
            <pubDate>Wed, 03 Jun 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[The Network Topology plugin for Kibana provides a ready-to-deploy Logstash pipeline, a structured schema, and a topology view that shows what's connected to what.]]></description>
            <content:encoded><![CDATA[<h2>SNMP collection shouldn't require a side quest.</h2>
<p>Getting SNMP data into Elasticsearch unlocks rich visibility into your network — interface utilization, routing health, L2 forwarding, and more. The path there involves a few familiar steps: choosing which MIBs to walk, mapping OIDs to human-readable field names, configuring SNMP v2c or v3 authentication, accommodating vendor-specific MIB extensions, and tuning the pipeline to gracefully handle device timeouts across large inventories. With a solid template in place, what used to be a bespoke Logstash project becomes a repeatable, shareable setup that any engineer on the team can pick up and extend.</p>
<p><a href="https://github.com/elastic/kibana-network-topology-plugin">The plugin</a> includes a Logstash pipeline <a href="https://github.com/elastic/kibana-network-topology-plugin/blob/main/docs/collectors/logstash.conf">template</a> that handles the common cases out of the box. It walks IF-MIB (interface counters and status), IP-MIB (ARP tables and IP address assignments), BRIDGE-MIB (MAC address forwarding tables), BGP4-MIB (BGP peer sessions), and OSPF-MIB (OSPF neighbor adjacencies) per target device on a configurable poll interval. You add your device list and authentication details, start Logstash, and data begins flowing into Elasticsearch.</p>
<p>The template also handles the operational annoyances that trip people up: poll timeouts, missing OID branches on devices that don't support a given MIB, and batching walks across large device inventories.</p>
<h2>Structuring SNMP data in Elasticsearch: schema design</h2>
<p>Once SNMP data lands in Elasticsearch, the next problem is structure. Interface counters like <code>ifInOctets</code> and <code>ifOperStatus</code> map to ECS concepts reasonably well. They're host-level metrics with direct equivalents in <code>host.network.*</code> fields. But the data network engineers actually need for troubleshooting is relational, and this plugin offers a way to view those relationships.</p>
<p>A BGP peer session has a state, a remote AS number, an uptime, and an update count. An OSPF adjacency has a neighbor router ID, an area, a priority, and a state machine position. A MAC table entry records which physical switch port a given MAC address was learned on. None of these have ECS field definitions, and stuffing them into generic <code>event.*</code> or <code>observer.*</code> fields loses the semantic meaning that makes the data useful.</p>
<p>The plugin takes an opinionated approach: use ECS where it fits, extend with clear namespaces where it doesn't. Interface data maps to ECS-aligned fields. Routing protocol and L2 forwarding data goes into purpose-built namespaces (<code>bgp_peer.*</code>, <code>ospf_neighbor.*</code>, <code>arp.*</code>, <code>mac_table.*</code>) with field names that match the concepts operators already think in. If you know what <code>bgpPeerState</code> means on a router CLI, <code>bgp_peer.state</code> in Elasticsearch is immediately familiar. If you already collect SNMP data in a homegrown schema, the plugin's templates and ingest pipeline will complement it rather than replace it. The new fields are additive, so you can adopt them at your own pace!</p>
<table>
<thead>
<tr>
<th>Data Type</th>
<th>Key Fields</th>
<th>ECS Namespace</th>
</tr>
</thead>
<tbody>
<tr>
<td>BGP Peer Session</td>
<td>State, Remote AS, Uptime, Update Count</td>
<td><code>bgp_peer.*</code></td>
</tr>
<tr>
<td>OSPF Adjacency</td>
<td>Neighbor Router ID, Area, Priority, State</td>
<td><code>ospf_neighbor.*</code></td>
</tr>
<tr>
<td>MAC Table Entry</td>
<td>Switch Port, Learned MAC Address</td>
<td><code>mac_table.*</code></td>
</tr>
<tr>
<td>ARP Entry</td>
<td>IP-to-MAC Mapping</td>
<td><code>arp_table.*</code></td>
</tr>
</tbody>
</table>
<p>An ingest pipeline (<code>snmp-device-enrichment</code>) handles classification at index time. It parses each device's <code>sysDescr</code> string to assign a normalized <code>device.type</code> (router, switch, firewall, access point) and <code>device.vendor</code>, so every downstream consumer (dashboards, ES|QL queries, alerting rules, the topology view) gets consistent device metadata without custom parsing. The pipeline recognizes common vendors out of the box and is extensible for environments with less common hardware.</p>
<p>The result is SNMP data you can query like any other structured data in Elasticsearch. &quot;Show me every BGP session not in Established state&quot; is a filter, not a regex exercise. &quot;Which Cisco switches have interfaces that are admin-up but oper-down&quot; is a KQL query, not a script.</p>
<h2>Visualising SNMP network topology in Kibana</h2>
<p>Dashboards excel at answering &quot;what are the numbers?&quot; A topological view answers a complementary question: &quot;what's connected to what?&quot; Network engineers think in topology: upstream and downstream relationships, path diversity, and blast radius of a link failure. A spatial, graph-based view brings that mental model directly into Kibana, sitting alongside the charts and data tables operators already rely on.</p>
<p>The plugin adds an interactive topology graph to Kibana's Observability navigation. It reads the structured SNMP data from Elasticsearch, builds an adjacency graph from ARP, MAC table, BGP, and OSPF relationships, and renders it as a force-directed layout you can zoom, pan, and rearrange. Nodes are devices, edges are discovered relationships, and clicking any device opens a flyout with its interface table, ARP neighbors, and routing protocol sessions.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/snmp-topology-data-kibana-collection-canvas/topo-diagram.png" alt="Network Topology Diagram" /></p>
<h2>How do you set up SNMP network topology monitoring in Kibana?</h2>
<p><img src="https://www.elastic.co/observability-labs/assets/images/snmp-topology-data-kibana-collection-canvas/setup-tab.png" alt="Setup Overview" /></p>
<p>The plugin is nearly ready to go out of the box, <a href="https://www.elastic.co/docs/solutions/observability/infra-and-hosts/network-topology/monitor-network-devices">only a few assets need installation</a>. Here's what the setup looks like:</p>
<ol>
<li>
<p><strong>Install the plugin zip</strong> on a self-managed Kibana instance (<code>bin/kibana-plugin install file:///path/to/networkTopology-&lt;version&gt;.zip</code>).</p>
</li>
<li>
<p><strong>Apply the index templates and ingest pipeline</strong>. Click through the template installation in the plugin's Setup tab. A few button clicks and the schema is in place.</p>
</li>
<li>
<p><strong>Deploy the Logstash pipeline.</strong> Add your device targets, authentication details, or other configuration to the included template and start it. If you're using <a href="https://www.elastic.co/docs/reference/logstash/logstash-centralized-pipeline-management">Logstash Centralized Pipeline Management</a>, push it from Kibana, no SSH required.</p>
</li>
</ol>
<p>Data hits Elasticsearch on the next poll cycle, the ingest pipeline classifies and enriches them, and the topology view populates. Start to finish, you're looking at minutes, not hours or days of trial and error.</p>
<p>A <a href="https://github.com/elastic/kibana-network-topology-plugin/blob/main/scripts/generate_sample_data.mjs">sample data generator</a> is included for teams that want to evaluate the plugin before connecting to live infrastructure; spin up a <a href="https://www.elastic.co/docs/deploy-manage/deploy/self-managed/install-elasticsearch-docker-basic">Docker development environment</a> and explore the full feature set with a realistic multi-site network.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/snmp-topology-data-kibana-collection-canvas/cover.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Easily analyze AWS VPC Flow Logs with Elastic Observability]]></title>
            <link>https://www.elastic.co/observability-labs/blog/vpc-flow-logs-monitoring-analytics-observability</link>
            <guid isPermaLink="false">vpc-flow-logs-monitoring-analytics-observability</guid>
            <pubDate>Mon, 23 Jan 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[Elastic Observability can ingest and help analyze AWS VPC Flow Logs from your application’s VPC. Learn how to ingest AWS VPC Flow Logs through a step-by-step method into Elastic, then analyze it and apply OOTB machine learning for insights.]]></description>
            <content:encoded><![CDATA[<p>Elastic Observability provides a full-stack observability solution, by supporting metrics, traces, and logs for applications and infrastructure. In <a href="https://www.elastic.co/blog/aws-service-metrics-monitor-observability-easy">a previous blog</a>, I showed you an <a href="https://www.elastic.co/observability/aws-monitoring">AWS monitoring</a> infrastructure running a three-tier application. Specifically we reviewed metrics ingest and analysis on Elastic Observability for EC2, VPC, ELB, and RDS. In this blog, we will cover how to ingest logs from AWS, and more specifically, we will review how to get VPC Flow Logs into Elastic and what you can do with this data.</p>
<p>Logging is an important part of observability, for which we generally think of metrics and/or tracing. However, the amount of logs an application or the underlying infrastructure output can be significantly daunting.</p>
<p>With Elastic Observability, there are three main mechanisms to ingest logs:</p>
<ul>
<li>The new Elastic Agent pulls metrics and logs from CloudWatch and S3 where logs are generally pushed from a service (for example, EC2, ELB, WAF, Route53, etc ). We reviewed Elastic agent metrics configuration for EC2, RDS (Aurora), ELB, and NAT metrics in this <a href="https://www.elastic.co/blog/aws-service-metrics-monitor-observability-easy">blog</a>.</li>
<li>Using <a href="https://www.elastic.co/blog/elastic-and-aws-serverless-application-repository-speed-time-to-actionable-insights-with-frictionless-log-ingestion-from-amazon-s3">Elastic’s Serverless Forwarder (runs on Lambda and available in AWS SAR)</a> to send logs from Firehose, S3, CloudWatch, and other AWS services into Elastic.</li>
<li>Beta feature (contact your Elastic account team): Using AWS Firehose to directly insert logs from AWS into Elastic — specifically if you are running the Elastic stack on AWS infrastructure.</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/vpc-flow-logs-monitoring-analytics-observability/Elastic-Observability-VPC-Flow-Logs.jpg" alt="" /></p>
<p>In this blog we will provide an overview of the second option, Elastic’s serverless forwarder collecting VPC Flow Logs from an application deployed on EC2 instances. Here’s what we'll cover:</p>
<ul>
<li>A walk-through on how to analyze VPC Flow Log info with Elastic’s Discover, dashboard, and ML analysis.</li>
<li>A detailed step-by-step overview and setup of the Elastic serverless forwarder on AWS as a pipeline for VPC Flow Logs into <a href="http://cloud.elastic.co">Elastic Cloud</a>.</li>
</ul>
<h2>Elastic’s serverless forwarder on AWS Lambda</h2>
<p>AWS users can quickly ingest logs stored in Amazon S3, CloudWatch, or Kinesis with the Elastic serverless forwarder, an AWS Lambda application, and view them in the Elastic Stack alongside other logs and metrics for centralized analytics. Once the AWS serverless forwarder is configured and deployed from AWS, Serverless Application Registry (SAR) logs will be ingested and available in Elastic for analysis. See the following links for further configuration guidance:</p>
<ul>
<li><a href="https://www.elastic.co/blog/elastic-and-aws-serverless-application-repository-speed-time-to-actionable-insights-with-frictionless-log-ingestion-from-amazon-s3">Elastic’s serverless forwarder (runs Lambda and available in AWS SAR)</a></li>
<li><a href="https://github.com/elastic/elastic-serverless-forwarder/blob/main/docs/README-AWS.md#s3_config_file">Serverless forwarder GitHub repo</a></li>
</ul>
<p>In our configuration we will ingest VPC Flow Logs into Elastic for the three-tier app deployed in the previous <a href="https://www.elastic.co/blog/aws-service-metrics-monitor-observability-easy">blog</a>.</p>
<p>There are three different configurations with the Elastic serverless forwarder:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/vpc-flow-logs-monitoring-analytics-observability/blog-elastic-vpc-flow-logs-3-configurations.png" alt="" /></p>
<p>Logs can be directly ingested from:</p>
<ul>
<li><strong>Amazon CloudWatch:</strong> Elastic serverless forwarder can pull VPC Flow Logs directly from an Amazon CloudWatch log group, which is a commonly used endpoint to store VPC Flow Logs in AWS.</li>
<li><strong>Amazon Kinesis:</strong> Elastic serverless forwarder can pull VPC Flow Logs directly from Kinesis, which is another location to <a href="https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs-firehose.html">publish VPC Flow Logs</a>.</li>
<li><strong>Amazon S3:</strong> Elastic serverless forwarder can pull VPC Flow Logs from Amazon S3 via SQS event notifications, which is a common endpoint to publish VPC Flow Logs in AWS.</li>
</ul>
<p>We will review how to utilize a common configuration, which is to send VPC Flow Logs to Amazon S3 and into Elastic Cloud in the second half of this blog.</p>
<p>But first let's review how to analyze VPC Flow Logs on Elastic.</p>
<h2>Analyzing VPC Flow Logs in Elastic</h2>
<p>Now that you have VPC Flow Logs in Elastic Cloud, how can you analyze them?</p>
<p>There are several analyses you can perform on the VPC Flow Log data:</p>
<ol>
<li>Use Elastic’s Analytics Discover capabilities to manually analyze the data.</li>
<li>Use Elastic Observability’s anomaly feature to identify anomalies in the logs.</li>
<li>Use an out-of-the-box (OOTB) dashboard to further analyze data.</li>
</ol>
<h3>Using Elastic Discover</h3>
<p>In Elastic analytics, you can search and filter your data, get information about the structure of the fields, and display your findings in a visualization. You can also customize and save your searches and place them on a dashboard. With Discover, you can:</p>
<ul>
<li>View logs in bulk, within specific time frames</li>
<li>Look at individual details of each entry (document)</li>
<li>Filter for specific values</li>
<li>Analyze fields</li>
<li>Create and save searches</li>
<li>Build visualizations</li>
</ul>
<p>For a complete understanding of Discover and all of Elastic’s analytics capabilities, look at <a href="https://www.elastic.co/guide/en/kibana/current/discover.html#">Elastic documentation</a>.</p>
<p>For VPC Flow Logs, an important stat is to understand:</p>
<ul>
<li>How many logs were accepted/rejected</li>
<li>Where potential security violations are occur (for example, source IPs from outside the VPC)</li>
<li>What port is generally being queried</li>
</ul>
<p>I’ve filtered the logs on the following:</p>
<ul>
<li>Amazon S3: bshettisartest</li>
<li>VPC Flow Log action: REJECT</li>
<li>VPC Network Interface: Webserver 1</li>
</ul>
<p>We want to see what IP addresses are trying to hit our web servers.</p>
<p>From that, we want to understand which IP addresses we are getting the most REJECTS from, and we simply find the <strong>source</strong>.ip field. Then, we can quickly get a breakdown that shows 185.242.53.156 is the most rejected for the last 3+ hours we’ve turned on VPC Flow Logs.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/vpc-flow-logs-monitoring-analytics-observability/blog-elastic-vpc-flow-logs-100-hits.png" alt="" /></p>
<p>Additionally, I can see a visualization by selecting the “Visualize” button. We get the following, which we can add to a dashboard:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/vpc-flow-logs-monitoring-analytics-observability/blog-elastic-vpc-flow-logs-add-to-a-dashboard.png" alt="" /></p>
<p>In addition to IP addresses, we want to also see what port is being hit on our web servers.<br />
We select the destination port field, and the quick pop-up shows us a list of ports being targeted. We can see that port 23 is being targeted (this port is generally used for telnet), port 445 is being targeted (used for Microsoft Active Directory), and port 433 (used for https ssl). We also see these are all REJECT.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/vpc-flow-logs-monitoring-analytics-observability/blog-elastic-vpc-flow-logs-reject.png" alt="" /></p>
<h3>Anomaly detection in Elastic Observability logs</h3>
<p>Addition to Discover, Elastic Observability provides the ability to detect anomalies on logs. In Elastic Observability -&gt; logs -&gt; anomalies you can turn on machine learning for:</p>
<ul>
<li>Log rate: automatically detects anomalous log entry rates</li>
<li>Categorization: automatically categorizes log messages</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/vpc-flow-logs-monitoring-analytics-observability/blog-elastic-vpc-flow-logs-anomaly-detection-with-machine-learning.png" alt="" /></p>
<p>For our VPC Flow Log, we turned both on. And when we look at what has been detected for anomalous log entry rates, we see:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/vpc-flow-logs-monitoring-analytics-observability/blog-elastic-vpc-flow-logs-anomalies.png" alt="" /></p>
<p>Elastic immediately detected a spike in logs when we turned on VPC Flow Logs for our application. The rate change is being detected because we’re also ingesting VPC Flow Logs from another application for a couple of days prior to adding the application in this blog.</p>
<p>We can further drill down into this anomaly with machine learning and analyze further.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/vpc-flow-logs-monitoring-analytics-observability/blog-elastic-vpc-flow-logs-anomaly-explorer.png" alt="" /></p>
<p>There is more machine learning analysis you can utilize with your logs — check out <a href="https://www.elastic.co/guide/en/kibana/8.5/xpack-ml.html">Elastic machine learning documentation</a>.</p>
<p>Since we know that a spike exists, we can also use Elastic AIOps Labs Explain Log Rate Spikes capability in Machine Learning. Additionally, we’ve grouped them to see what is causing some of the spikes.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/vpc-flow-logs-monitoring-analytics-observability/blog-elastic-vpc-flow-logs-explain-log-rate-spikes.png" alt="" /></p>
<p>As we can see, a specific network interface is sending more VPC log flows than others. We can further drill down into this further in Discover.</p>
<h3>VPC Flow Log dashboard on Elastic Observability</h3>
<p>Finally, Elastic also provides an OOTB dashboard to showing the top IP addresses hitting your VPC, geographically where they are coming from, the time series of the flows, and a summary of VPC Flow Log rejects within the time frame.</p>
<p>This is a baseline dashboard that can be enhanced with visualizations you find in Discover, as we reviewed in option 1 (Using Elastic’s Analytics Discover capabilities) above.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/vpc-flow-logs-monitoring-analytics-observability/blog-elastic-vpc-flow-logs-action-geolocation.png" alt="" /></p>
<h2>Setting it all up</h2>
<p>Let’s walk through the details of configuring Amazon Kinesis Data Firehose and Elastic Observability to ingest data.</p>
<h3>Prerequisites and config</h3>
<p>If you plan on following steps, here are some of the components and details we used to set up this demonstration:</p>
<ul>
<li>Ensure you have an account on <a href="http://cloud.elastic.co">Elastic Cloud</a> and a deployed stack (<a href="https://www.elastic.co/guide/en/elastic-stack/current/installing-elastic-stack.html">see instructions here</a>) on AWS. Deploying this on AWS is required for Elastic Serverless Forwarder.</li>
<li>Ensure you have an AWS account with permissions to pull the necessary data from AWS. Specifically, ensure you can configure the agent to pull data from AWS as needed. <a href="https://docs.elastic.co/integrations/aws#requirements">Please look at the documentation for details</a>.</li>
<li>We used <a href="https://github.com/aws-samples/aws-three-tier-web-architecture-workshop">AWS’s three-tier app</a> and installed it as instructed in GitHub. (<a href="https://www.elastic.co/blog/aws-service-metrics-monitor-observability-easy">See blog on ingesting metrics from the AWS services supporting this app</a>.)</li>
<li>Configure and install Elastic’s Serverless Forwarder.</li>
<li>Ensure you turn on VPC Flow Logs for the VPC where the application is deployed and send logs to AWS Firehose.</li>
</ul>
<h3>Step 0: Get an account on Elastic Cloud</h3>
<p>Follow the instructions to <a href="https://cloud.elastic.co/registration?fromURI=/home">get started on Elastic Cloud</a>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/vpc-flow-logs-monitoring-analytics-observability/blog-elastic-vpc-flow-logs-start-cloud-trial.png" alt="" /></p>
<h3>Step 1: Deploy Elastic on AWS</h3>
<p>Once logged in to Elastic Cloud, create a deployment on AWS. It’s important to ensure that the deployment is on AWS. The Amazon Kinesis Data Firehose connects specifically to an endpoint that needs to be on AWS.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/vpc-flow-logs-monitoring-analytics-observability/blog-elastic-vpc-flow-logs-create-a-deployment.png" alt="" /></p>
<p>Once your deployment is created, make sure you copy the Elasticsearch endpoint.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/vpc-flow-logs-monitoring-analytics-observability/blog-elastic-vpc-flow-logs-aws-logs.png" alt="" /></p>
<p>The endpoint should be an AWS endpoint, such as:</p>
<pre><code class="language-bash">https://aws-logs.es.us-east-1.aws.found.io
</code></pre>
<h3>Step 2: Turn on Elastic’s AWS Integrations on AWS</h3>
<p>In your deployment’s Elastic Integration section, go to the AWS integration and select Install AWS assets.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/vpc-flow-logs-monitoring-analytics-observability/blog-elastic-vpc-flow-logs-aws-settings.png" alt="" /></p>
<h3>Step 3: Deploy your application</h3>
<p>Follow the instructions listed out in <a href="https://github.com/aws-samples/aws-three-tier-web-architecture-workshop">AWS’s Three-Tier app</a> and instructions in the workshop link on GitHub. The workshop is listed <a href="https://catalog.us-east-1.prod.workshops.aws/workshops/85cd2bb2-7f79-4e96-bdee-8078e469752a/en-US">here</a>.</p>
<p>Once you’ve installed the app, get credentials from AWS. This will be needed for Elastic’s AWS integration.</p>
<p>There are several options for credentials:</p>
<ul>
<li>Use access keys directly</li>
<li>Use temporary security credentials</li>
<li>Use a shared credentials file</li>
<li>Use an IAM role Amazon Resource Name (ARN)</li>
</ul>
<p>View more details on specifics around necessary <a href="https://docs.elastic.co/en/integrations/aws#aws-credentials">credentials</a> and <a href="https://docs.elastic.co/en/integrations/aws#aws-permissions">permissions</a>.</p>
<h3>Step 4: Send VPC Flow Logs to Amazon S3 and set up Amazon SQS</h3>
<p>In the VPC for the application deployed in Step 3, you will need to configure VPC Flow Logs and point them to an Amazon S3 bucket. Specifically, you will want to keep it as AWS default format.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/vpc-flow-logs-monitoring-analytics-observability/blog-elastic-vpc-flow-logs-create-flow-log.png" alt="" /></p>
<p>Create the VPC Flow log.</p>
<p>Next:</p>
<ul>
<li><a href="https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-getting-started.html">Set up an Amazon SQS queue</a></li>
<li><a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/ways-to-add-notification-config-to-bucket.html">Configure Amazon S3 event notifications</a></li>
</ul>
<h3>Step 5: Set up Elastic Serverless Forwarder on AWS</h3>
<p>Follow instructions listed in <a href="https://www.elastic.co/guide/en/observability/8.5/aws-deploy-elastic-serverless-forwarder.html">Elastic’s documentation</a> and refer to the <a href="https://www.elastic.co/blog/elastic-and-aws-serverless-application-repository-speed-time-to-actionable-insights-with-frictionless-log-ingestion-from-amazon-s3">previous blog</a> providing an overview. The important bits during the configuration in Lambda’s application repository are to ensure you:</p>
<ul>
<li>Specify the S3 Bucket in ElasticServerlessForwarderS3Buckets where the VPC Flow Logs are being sent. The value is the ARN of the S3 Bucket you created in Step 4.</li>
<li>Specify the configuration file path in ElasticServerlessForwarderS3ConfigFile. The value is the S3 url in the format &quot;s3://bucket-name/config-file-name&quot; pointing to the configuration file (sarconfig.yaml).</li>
<li>Specify the S3 SQS Notifications queue used as the trigger of the Lambda function in ElasticServerlessForwarderS3SQSEvents. The value is the ARN of the SQS Queue you set up in Step 4.</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/vpc-flow-logs-monitoring-analytics-observability/blog-elastic-vpc-flow-logs-application-settings.png" alt="" /></p>
<p>Once Amazon CloudFormation finishes setting up Elastic serverless forwarder, you should see two Amazon Lambda functions:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/vpc-flow-logs-monitoring-analytics-observability/blog-elastic-vpc-flow-logs-functions.png" alt="" /></p>
<p>In order to check if logs are coming in, go to the function with “ <strong>ApplicationElasticServer</strong> ” in the name, and go to monitor and look at <strong>logs</strong>. You should see the logs being pulled from S3.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/vpc-flow-logs-monitoring-analytics-observability/blog-elastic-vpc-flow-logs-function-overview.png" alt="" /></p>
<h3>Step 6: Check and ensure you have logs in Elastic</h3>
<p>Now that steps 1–4 are complete, you can go to Elastic’s Discover capability and you should see VPC Flow Logs coming in. In the image below, we’ve filtered by Amazon S3 bucket <strong>bshettisartest</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/vpc-flow-logs-monitoring-analytics-observability/blog-elastic-vpc-flow-log-dashboard-filter.png" alt="" /></p>
<h2>Conclusion: Elastic Observability easily integrates with VPC Flow Logs for analytics, alerting, and insights</h2>
<p>I hope you’ve gotten an appreciation for how Elastic Observability can help you manage AWS VPC Flow Logs. Here’s a quick recap of lessons and what you learned:</p>
<ul>
<li>A walk-through of how Elastic Observability provides enhanced analysis for VPC Flow Logs:
<ul>
<li>Using Elastic’s Analytics Discover capabilities to manually analyze the data</li>
<li>Leveraging Elastic Observability’s anomaly features to:
<ul>
<li>Identify anomalies in the VPC flow logs</li>
<li>Detects anomalous log entry rates</li>
<li>Automatically categorizes log messages</li>
</ul>
</li>
<li>Using an OOTB dashboard to further analyze data</li>
</ul>
</li>
<li>A more detailed walk-through of how to set up the Elastic Serverless Forwarder</li>
</ul>
<p>Start your own <a href="https://aws.amazon.com/marketplace/pp/prodview-voru33wi6xs7k?trk=5fbc596b-6d2a-433a-8333-0bd1f28e84da%E2%89%BBchannel=el">7-day free trial</a> by signing up via <a href="https://aws.amazon.com/marketplace/pp/prodview-voru33wi6xs7k?trk=d54b31eb-671c-49ba-88bb-7a1106421dfa%E2%89%BBchannel=el">AWS Marketplace</a> and quickly spin up a deployment in minutes on any of the <a href="https://www.elastic.co/guide/en/cloud/current/ec-reference-regions.html#ec_amazon_web_services_aws_regions">Elastic Cloud regions on AWS</a> around the world. Your AWS Marketplace purchase of Elastic will be included in your monthly consolidated billing statement and will draw against your committed spend with AWS.</p>
<h3>Additional logging resources:</h3>
<ul>
<li><a href="https://www.elastic.co/getting-started/observability/collect-and-analyze-logs">Getting started with logging on Elastic (quickstart)</a></li>
<li><a href="https://www.elastic.co/guide/en/observability/current/logs-metrics-get-started.html">Ingesting common known logs via integrations (compute node example)</a></li>
<li><a href="https://docs.elastic.co/integrations">List of integrations</a></li>
<li><a href="https://www.elastic.co/blog/log-monitoring-management-enterprise">Ingesting custom application logs into Elastic</a></li>
<li><a href="https://www.elastic.co/blog/observability-logs-parsing-schema-read-write">Enriching logs in Elastic</a></li>
<li>Analyzing Logs with <a href="https://www.elastic.co/blog/reduce-mttd-ml-machine-learning-observability">Anomaly Detection (ML)</a> and <a href="https://www.elastic.co/blog/observability-logs-machine-learning-aiops">AIOps</a></li>
</ul>
<h3>Common use case examples with logs:</h3>
<ul>
<li><a href="https://youtu.be/ax04ZFWqVCg">Nginx log management</a></li>
<li><a href="https://www.elastic.co/blog/vpc-flow-logs-monitoring-analytics-observability">AWS VPC Flow log management</a></li>
<li><a href="https://www.elastic.co/blog/kubernetes-errors-elastic-observability-logs-openai">Using OpenAI to analyze Kubernetes errors</a></li>
<li><a href="https://youtu.be/Li5TJAWbz8Q">PostgreSQL issue analysis with AIOps</a></li>
</ul>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/vpc-flow-logs-monitoring-analytics-observability/patterns-midnight-background-no-logo-observability.png" length="0" type="image/png"/>
        </item>
    </channel>
</rss>