<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/">
    <channel>
        <title>Elastic Observability Labs - Articles by Tamara Dancheva</title>
        <link>https://www.elastic.co/observability-labs</link>
        <description>Trusted security news &amp; research from the team at Elastic.</description>
        <lastBuildDate>Mon, 08 Jun 2026 15:18:17 GMT</lastBuildDate>
        <docs>https://validator.w3.org/feed/docs/rss2.html</docs>
        <generator>https://github.com/jpmonette/feed</generator>
        <image>
            <title>Elastic Observability Labs - Articles by Tamara Dancheva</title>
            <url>https://www.elastic.co/observability-labs/assets/observability-labs-thumbnail.png</url>
            <link>https://www.elastic.co/observability-labs</link>
        </image>
        <copyright>© 2026. Elasticsearch B.V. All Rights Reserved</copyright>
        <item>
            <title><![CDATA[Monitor dbt pipelines with Elastic Observability]]></title>
            <link>https://www.elastic.co/observability-labs/blog/monitor-dbt-pipelines-with-elastic-observability</link>
            <guid isPermaLink="false">monitor-dbt-pipelines-with-elastic-observability</guid>
            <pubDate>Fri, 26 Jul 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how to set up a dbt monitoring system with Elastic that proactively alerts on data processing cost spikes, anomalies in rows per table, and data quality test failures]]></description>
            <content:encoded><![CDATA[<p>In the Data Analytics team within the Observability organization in Elastic, we use <a href="https://www.getdbt.com/product/what-is-dbt">dbt (dbt™, data build tool)</a> to execute our SQL data transformation pipelines. dbt is a SQL-first transformation workflow that lets teams quickly and collaboratively deploy analytics code. In particular, we use <a href="https://docs.getdbt.com/docs/core/installation-overview">dbt core</a>, the <a href="https://github.com/dbt-labs/dbt-core">open-source project</a>, where you can develop from the command line and run your dbt project.</p>
<p>Our data transformation pipelines run daily and process the data that feed our internal dashboards, reports, analyses, and Machine Learning (ML) models.</p>
<p>There have been incidents in the past when the pipelines have failed, the source tables contained wrong data or we have introduced a change into our SQL code that has caused data quality issues, and we only realized once we saw it in a weekly report that was showing an anomalous number of records. That’s why we have built a monitoring system that proactively alerts us about these types of incidents as soon as they happen and helps us with visualizations and analyses to understand their root cause, saving us several hours or days of manual investigations.</p>
<p>We have leveraged our own Observability Solution to help solve this challenge, monitoring the entire lifecycle of our dbt implementation. This setup enables us to track the behavior of our models and conduct data quality testing on the final tables. We export dbt process logs from run jobs and tests into Elasticsearch and utilize Kibana to create dashboards, set up alerts, and configure Machine Learning jobs to monitor and assess issues.</p>
<p>The following diagram shows our complete architecture. In a follow-up article, we’ll also cover how we observe our python data processing and ML model processes using OTEL and Elastic - stay tuned.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-dbt-pipelines-with-elastic-observability/architecture.png" alt="1 - architecture" /></p>
<h2>Why monitor dbt pipelines with Elastic?</h2>
<p>With every invocation, dbt generates and saves one or more JSON files called <a href="https://docs.getdbt.com/reference/artifacts/dbt-artifacts">artifacts</a> containing log data on the invocation results. <code>dbt run</code> and <code>dbt test</code> invocation logs are <a href="https://docs.getdbt.com/reference/artifacts/run-results-json">stored in the file <code>run_results.json</code></a>, as per the dbt documentation:</p>
<blockquote>
<p>This file contains information about a completed invocation of dbt, including timing and status info for each node (model, test, etc) that was executed. In aggregate, many <code>run_results.json</code> can be combined to calculate average model runtime, test failure rates, the number of record changes captured by snapshots, etc.</p>
</blockquote>
<p>Monitoring <code>dbt run</code> invocation logs can help solve several issues, including tracking and alerting about table volumes, detecting excessive slot time from resource-intensive models, identifying cost spikes due to slot time or volume, and pinpointing slow execution times that may indicate scheduling issues. This system was crucial when we merged a PR with a change in our code that had an issue, producing a sudden drop in the number of daily rows in upstream Table A. By ingesting the <code>dbt run</code> logs into Elastic, our anomaly detection job quickly identified anomalies in the daily row counts for Table A and its downstream tables, B, C, and D. The Data Analytics team received an alert notification about the issue, allowing us to promptly troubleshoot, fix and backfill the tables before it affected the weekly dashboards and downstream ML models.</p>
<p>Monitoring <code>dbt test</code> invocation logs can also address several issues, such as identifying duplicates in tables, detecting unnoticed alterations in allowed values for specific fields through validation of all enum fields, and resolving various other data processing and quality concerns. With dashboards and alerts on data quality tests, we proactively identify issues like duplicate keys, unexpected category values, and increased nulls, ensuring data integrity. In our team, we had an issue where a change in one of our raw lookup tables produced duplicated rows in our user table, doubling the number of users reported. By ingesting the <code>dbt test</code> logs into Elastic, our rules detected that some duplicate tests had failed. The team received an alert notification about the issue, allowing us to troubleshoot it right away by finding the upstream table that was the root cause. These duplicates meant that downstream tables had to process 2x the amount of data, creating a spike in the bytes processed and slot time. The anomaly detection and alerts on the <code>dbt run</code> logs also helped us spot these spikes for individual tables and allowed us to quantify the impact on our billing.</p>
<p>Processing our dbt logs with Elastic and Kibana allows us to obtain real-time insights, helps us quickly troubleshoot potential issues, and keeps our data transformation processes running smoothly. We set up anomaly detection jobs and alerts in Kibana to monitor the number of rows processed by dbt, the slot time, and the results of the tests. This lets us catch real-time incidents, and by promptly identifying and fixing these issues, Elastic makes our data pipeline more resilient and our models more cost-effective, helping us stay on top of cost spikes or data quality issues.</p>
<p>We can also correlate this information with other events ingested into Elastic, for example using the <a href="https://www.elastic.co/guide/en/enterprise-search/current/connectors-github.html">Elastic Github connector</a>, we can correlate data quality test failures or other anomalies with code changes to find the root cause of the commit or PR that caused the issues. By ingesting application logs into Elastic, we can also analyze if these issues in our pipelines have affected downstream applications, increasing latency, throughput or error rates using APM. Ingesting billing, revenue data or web traffic, we could also see the impact in business metrics.</p>
<h2>How to export dbt invocation logs to Elasticsearch</h2>
<p>We use the <a href="https://elasticsearch-py.readthedocs.io/en">Python Elasticsearch client</a> to send the dbt invocation logs to Elastic after we run our <code>dbt run</code> and <code>dbt test</code> processes daily in production. The setup just requires you to install the <a href="https://elasticsearch-py.readthedocs.io/en/v8.14.0/quickstart.html#installation">Elasticsearch Python client</a> and obtain your Elastic Cloud ID (go to <a href="https://cloud.elastic.co/deployments/">https://cloud.elastic.co/deployments/</a>, select your deployment and find the <code>Cloud ID</code>) and Elastic Cloud API Key <a href="https://elasticsearch-py.readthedocs.io/en/v8.14.0/quickstart.html#connecting">(following this guide)</a></p>
<p>This python helper function will index the results from your <code>run_results.json</code> file to the specified index. You just need to export the variables to the environment:</p>
<ul>
<li><code>RESULTS_FILE</code>: path to your <code>run_results.json</code> file</li>
<li><code>DBT_RUN_LOGS_INDEX</code>: the name you want to give to dbt run logs index in Elastic, e.g. <code>dbt_run_logs</code></li>
<li><code>DBT_TEST_LOGS_INDEX</code>: the name you want to give to the dbt test logs index in Elastic, e.g. <code>dbt_test_logs</code></li>
<li><code>ES_CLUSTER_CLOUD_ID</code></li>
<li><code>ES_CLUSTER_API_KEY</code></li>
</ul>
<p>Then call the function <code>log_dbt_es</code> from your python code or save this code as a python script and run it after executing your <code>dbt run</code> or <code>dbt test</code> commands:</p>
<pre><code>from elasticsearch import Elasticsearch, helpers
import os
import sys
import json

def log_dbt_es():
   RESULTS_FILE = os.environ[&quot;RESULTS_FILE&quot;]
   DBT_RUN_LOGS_INDEX = os.environ[&quot;DBT_RUN_LOGS_INDEX&quot;]
   DBT_TEST_LOGS_INDEX = os.environ[&quot;DBT_TEST_LOGS_INDEX&quot;]
   es_cluster_cloud_id = os.environ[&quot;ES_CLUSTER_CLOUD_ID&quot;]
   es_cluster_api_key = os.environ[&quot;ES_CLUSTER_API_KEY&quot;]


   es_client = Elasticsearch(
       cloud_id=es_cluster_cloud_id,
       api_key=es_cluster_api_key,
       request_timeout=120,
   )


   if not os.path.exists(RESULTS_FILE):
       print(f&quot;ERROR: {RESULTS_FILE} No dbt run results found.&quot;)
       sys.exit(1)


   with open(RESULTS_FILE, &quot;r&quot;) as json_file:
       results = json.load(json_file)
       timestamp = results[&quot;metadata&quot;][&quot;generated_at&quot;]
       metadata = results[&quot;metadata&quot;]
       elapsed_time = results[&quot;elapsed_time&quot;]
       args = results[&quot;args&quot;]
       docs = []
       for result in results[&quot;results&quot;]:
           if result[&quot;unique_id&quot;].split(&quot;.&quot;)[0] == &quot;test&quot;:
               result[&quot;_index&quot;] = DBT_TEST_LOGS_INDEX
           else:
               result[&quot;_index&quot;] = DBT_RUN_LOGS_INDEX
           result[&quot;@timestamp&quot;] = timestamp
           result[&quot;metadata&quot;] = metadata
           result[&quot;elapsed_time&quot;] = elapsed_time
           result[&quot;args&quot;] = args
           docs.append(result)
       _ = helpers.bulk(es_client, docs)
   return &quot;Done&quot;

# Call the function
log_dbt_es()
</code></pre>
<p>If you want to add/remove any other fields from <code>run_results.json</code>, you can modify the above function to do it.</p>
<p>Once the results are indexed, you can use Kibana to create Data Views for both indexes and start exploring them in Discover.</p>
<p>Go to Discover, click on the data view selector on the top left and “Create a data view”.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-dbt-pipelines-with-elastic-observability/discover-create-dataview.png" alt="2 - discover create a data view" /></p>
<p>Now you can create a data view with your preferred name. Do this for both dbt run (<code>DBT_RUN_LOGS_INDEX</code> in your code) and dbt test (<code>DBT_TEST_LOGS_INDEX</code> in your code) indices:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-dbt-pipelines-with-elastic-observability/create-dataview.png" alt="3 - create a data view" /></p>
<p>Going back to Discover, you’ll be able to select the Data Views and explore the data.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-dbt-pipelines-with-elastic-observability/discover-logs-explorer.png" alt="4 - discover logs explorer" /></p>
<h2>dbt run alerts, dashboards and ML jobs</h2>
<p>The invocation of <a href="https://docs.getdbt.com/reference/commands/run"><code>dbt run</code></a> executes compiled SQL model files against the current database. <code>dbt run</code> invocation logs contain the <a href="https://docs.getdbt.com/reference/artifacts/run-results-json">following fields</a>:</p>
<ul>
<li><code>unique_id</code>: Unique model identifier</li>
<li><code>execution_time</code>: Total time spent executing this model run</li>
</ul>
<p>The logs also contain the following metrics about the job execution from the adapter:</p>
<ul>
<li><code>adapter_response.bytes_processed</code></li>
<li><code>adapter_response.bytes_billed</code></li>
<li><code>adapter_response.slot_ms</code></li>
<li><code>adapter_response.rows_affected</code></li>
</ul>
<p>We have used Kibana to set up <a href="https://www.elastic.co/guide/en/machine-learning/current/ml-ad-run-jobs.html">Anomaly Detection jobs</a> on the above-mentioned metrics. You can configure a <a href="https://www.elastic.co/guide/en/machine-learning/current/ml-anomaly-detection-job-types.html#multi-metric-jobs">multi-metric job</a> split by <code>unique_id</code> to be alerted when the sum of rows affected, slot time consumed, or bytes billed is anomalous per table. You can track one job per metric. If you have built a dashboard of the metrics per table, you can use <a href="https://www.elastic.co/guide/en/machine-learning/8.14/ml-jobs-from-lens.html">this shortcut</a> to create the Anomaly Detection job directly from the visualization. After the jobs are created and are running on incoming data, you can <a href="https://www.elastic.co/guide/en/machine-learning/current/ml-ad-view-results.html">view the jobs</a> and add them to a dashboard using the three dots button in the anomaly timeline:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-dbt-pipelines-with-elastic-observability/ml-job-add-to-dashboard.png" alt="5 - add ML job to dashboard" /></p>
<p>We have used the <a href="https://www.elastic.co/guide/en/machine-learning/current/ml-configuring-alerts.html">ML job to set up alerts</a> that send us emails/slack messages when anomalies are detected. Alerts can be created directly from the Jobs (Machine Learning &gt; Anomaly Detection Jobs) page, by clicking on the three dots at the end of the ML job row:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-dbt-pipelines-with-elastic-observability/ml-job-create-alert.png" alt="6 - create alert from ML job" /></p>
<p>We also use <a href="https://www.elastic.co/guide/en/kibana/current/dashboard.html">Kibana dashboards</a> to visualize the anomaly detection job results and related metrics per table, to identify which tables consume most of our resources, to have visibility on their temporal evolution, and to measure aggregated metrics that can help us understand month over month changes.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-dbt-pipelines-with-elastic-observability/ml-job-dashboard.png" alt="7 - ML job in dashboard" />
<img src="https://www.elastic.co/observability-labs/assets/images/monitor-dbt-pipelines-with-elastic-observability/dashboard-slot-time.png" alt="8 - dashboard slot time chart" />
<img src="https://www.elastic.co/observability-labs/assets/images/monitor-dbt-pipelines-with-elastic-observability/dashboard-aggregated-metrics.png" alt="9 - dashboard aggregated metrics" /></p>
<h2>dbt test alerts and dashboards</h2>
<p>You may already be familiar with <a href="https://docs.getdbt.com/docs/build/data-tests">tests in dbt</a>, but if you’re not, dbt data tests are assertions you make about your models. Using the command <a href="https://docs.getdbt.com/reference/commands/test"><code>dbt test</code></a>, dbt will tell you if each test in your project passes or fails. <a href="https://docs.getdbt.com/docs/build/data-tests#example">Here is an example of how to set them up</a>. In our team, we use out-of-the-box dbt tests (<code>unique</code>, <code>not_null</code>, <code>accepted_values</code>, and <code>relationships</code>) and the packages <a href="https://hub.getdbt.com/dbt-labs/dbt_utils/latest/">dbt_utils</a> and <a href="https://hub.getdbt.com/calogica/dbt_expectations/latest/">dbt_expectations</a> for some extra tests. When the command <code>dbt test</code> is run, it generates logs that are stored in <code>run_results.json</code>.</p>
<p>dbt test logs contain the <a href="https://docs.getdbt.com/reference/artifacts/run-results-json">following fields</a>:</p>
<ul>
<li><code>unique_id</code>: Unique test identifier, tests contain the “test” prefix in their unique identifier</li>
<li><code>status</code>: result of the test, <code>pass</code> or <code>fail</code></li>
<li><code>execution_time</code>: Total time spent executing this test</li>
<li><code>failures</code>: will be 0 if the test passes and 1 if the test fails</li>
<li><code>message</code>: If the test fails, reason why it failed</li>
</ul>
<p>The logs also contain the metrics about the job execution from the adapter.</p>
<p>We have set up alerts on document count (see <a href="https://www.elastic.co/guide/en/observability/8.14/custom-threshold-alert.html">guide</a>) that will send us an email / slack message when there are any failed tests. The rule for the alerts is set up on the dbt test Data View that we have created before, the query filtering on <code>status:fail</code> to obtain the logs for the tests that have failed, and the rule condition is document count bigger than 0.
Whenever there is a failure in any test in production, we get an alert with links to the alert details and dashboards to be able to troubleshoot them:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-dbt-pipelines-with-elastic-observability/email-alert.png" alt="10 - alert" /></p>
<p>We have also built a dashboard to visualize the tests run, tests failed, and their execution time and slot time to have a historical view of the test run:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-dbt-pipelines-with-elastic-observability/dashboard-tests.png" alt="11 - dashboard dbt tests" /></p>
<h2>Finding Root Causes with the AI Assistant</h2>
<p>The most effective way for us to analyze these multiple sources of information is using the AI Assistant to help us troubleshoot the incidents. In our case, we got an alert about a test failure, and we used the AI Assistant to give us context on what happened. Then we asked if there were any downstream consequences, and the AI Assistant interpreted the results of the Anomaly Detection job, which indicated a spike in slot time for one of our downstream tables and the increase of the slot time vs. the baseline. Then, we asked for the root cause, and the AI Assistant was able to find and provide us a link to a PR from our Github changelog that matched the start of the incident and was the most probable cause.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-dbt-pipelines-with-elastic-observability/ai-assistant.png" alt="12 - ai assistant troubleshoot" /></p>
<h2>Conclusion</h2>
<p>As a Data Analytics team, we are responsible for guaranteeing that the tables, charts, models, reports, and dashboards we provide to stakeholders are accurate and contain the right sources of information. As teams grow, the number of models we own becomes larger and more interconnected, and it isn’t easy to guarantee that everything is running smoothly and providing accurate results. Having a monitoring system that proactively alerts us on cost spikes, anomalies in row counts, or data quality test failures is like having a trusted companion that will alert you in advance if something goes wrong and help you get to the root cause of the issue.</p>
<p>dbt invocation logs are a crucial source of information about the status of our data pipelines, and Elastic is the perfect tool to extract the maximum potential out of them. Use this blog post as a starting point for utilizing your dbt logs to help your team achieve greater reliability and peace of mind, allowing them to focus on more strategic tasks rather than worrying about potential data issues.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/monitor-dbt-pipelines-with-elastic-observability/monitoring-dbt-with-elastic.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Monitor your Python data pipelines with OTEL]]></title>
            <link>https://www.elastic.co/observability-labs/blog/monitor-your-python-data-pipelines-with-otel</link>
            <guid isPermaLink="false">monitor-your-python-data-pipelines-with-otel</guid>
            <pubDate>Thu, 08 Aug 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how to configure OTEL for your data pipelines, detect any anomalies, analyze performance, and set up corresponding alerts with Elastic.]]></description>
            <content:encoded><![CDATA[<p>This article delves into how to implement observability practices, particularly using <a href="https://opentelemetry.io/">OpenTelemetry (OTEL)</a> in Python, to enhance the monitoring and quality control of data pipelines using Elastic. While the primary focus of the examples presented in the article is ETL (Extract, Transform, Load) processes to ensure the accuracy and reliability of data pipelines that is crucial for Business Intelligence (BI), the strategies and tools discussed are equally applicable to Python processes used for Machine Learning (ML) models or other data processing tasks.</p>
<h2>Introduction</h2>
<p>Data pipelines, particularly ETL processes, form the backbone of modern data architectures. These pipelines are responsible for extracting raw data from various sources, transforming it into meaningful information, and loading it into data warehouses or data lakes for analysis and reporting.</p>
<p>In our organization, we have Python-based ETL scripts that play a pivotal role in exporting and processing data from Elasticsearch (ES) clusters and loading it into <a href="https://cloud.google.com/bigquery">Google BigQuery (BQ)</a>. This processed data then feeds into <a href="https://www.getdbt.com">DBT (Data Build Tool)</a> models, which further refine the data and make it available for analytics and reporting. To see the full architecture and learn how we monitor our DBT pipelines with Elastic see <a href="https://www.elastic.co/observability-labs/blog/monitor-dbt-pipelines-with-elastic-observability">Monitor your DBT pipelines with Elastic Observability</a>. In this article we focus on the ETL scripts. Given the critical nature of these scripts, it is imperative to set up mechanisms to control and ensure the quality of the data they generate.</p>
<p>The strategies discussed here can be extended to any script or application that handles data processing or machine learning models, regardless of the programming language used as long as there exists a corresponding agent that supports OTEL instrumentation.</p>
<h2>Motivation</h2>
<p>Observability in data pipelines involves monitoring the entire lifecycle of data processing to ensure that everything works as expected. It includes:</p>
<ol>
<li>Data Quality Control:</li>
</ol>
<ul>
<li>Detecting anomalies in the data, such as unexpected drops in record counts.</li>
<li>Verifying that data transformations are applied correctly and consistently.</li>
<li>Ensuring the integrity and accuracy of the data loaded into the data warehouse.</li>
</ul>
<ol start="2">
<li>Performance Monitoring:</li>
</ol>
<ul>
<li>Tracking the execution time of ETL scripts to identify bottlenecks and optimize performance.</li>
<li>Monitoring resource usage, such as memory and CPU consumption, to ensure efficient use of infrastructure.</li>
</ul>
<ol start="3">
<li>Real-time Alerting:</li>
</ol>
<ul>
<li>Setting up alerts for immediate notification of issues such as failed ETL jobs, data quality issues, or performance degradation.</li>
<li>Identify the root case of such incidents</li>
<li>Proactively addressing incidents to minimize downtime and impact on business operations</li>
</ul>
<p>Issues such as failed ETL jobs, can even point to larger infrastructure or data source data quality issues.</p>
<h2>Steps for Instrumentation</h2>
<p>Here are the steps to automatically instrument your Python script for exporting OTEL traces, metrics, and logs.</p>
<h3>Step 1: Import Required Libraries</h3>
<p>We first need to install the following libraries.</p>
<pre><code class="language-sh">pip install elastic-opentelemetry google-cloud-bigquery[opentelemetry]
</code></pre>
<p>You can also them to your project's <code>requirements.txt</code> file and install them with <code>pip install -r requirements.txt</code>.</p>
<h4>Explanation of Dependencies</h4>
<ol>
<li>
<p><strong>elastic-opentelemetry</strong>: This package is the Elastic Distribution for OpenTelemetry Python. Under the hood it will install the following packages:</p>
<ul>
<li>
<p><strong>opentelemetry-distro</strong>: This package is a convenience distribution of OpenTelemetry, which includes the OpenTelemetry SDK, APIs, and various instrumentation packages. It simplifies the setup and configuration of OpenTelemetry in your application.</p>
</li>
<li>
<p><strong>opentelemetry-exporter-otlp</strong>: This package provides an exporter that sends telemetry data to the OpenTelemetry Collector or any other endpoint that supports the OpenTelemetry Protocol (OTLP). This includes traces, metrics, and logs.</p>
</li>
<li>
<p><strong>opentelemetry-instrumentation-system-metrics</strong>: This package provides instrumentation for collecting system metrics, such as CPU usage, memory usage, and other system-level metrics.</p>
</li>
</ul>
</li>
<li>
<p><strong>google-cloud-bigquery[opentelemetry]</strong>: This package integrates Google Cloud BigQuery with OpenTelemetry, allowing you to trace and monitor BigQuery operations.</p>
</li>
</ol>
<h3>Step 2: Export OTEL Variables</h3>
<p>Set the necessary OpenTelemetry (OTEL) variables by getting the configuration from APM OTEL from Elastic.</p>
<p>Go to APM -&gt; Services -&gt; Add data (top left corner).</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-your-python-data-pipelines-with-otel/otel-variables-1.png" alt="1 - Get OTEL variables step 1" /></p>
<p>In this section you will find the steps how to configure various APM agents. Navigate to OpenTelemetry to find the variables that you need to export.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-your-python-data-pipelines-with-otel/otel-variables-2.png" alt="2 - Get OTEL variables step 2" /></p>
<p><strong>Find OTLP Endpoint</strong>:</p>
<ul>
<li>Look for the section related to OpenTelemetry or OTLP configuration.</li>
<li>The <code>OTEL_EXPORTER_OTLP_ENDPOINT</code> is typically provided as part of the setup instructions for integrating OpenTelemetry with Elastic APM. It might look something like <code>https://&lt;your-apm-server&gt;/otlp</code>.</li>
</ul>
<p><strong>Obtain OTLP Headers</strong>:</p>
<ul>
<li>In the same section, you should find instructions or a field for OTLP headers. These headers are often used for authentication purposes.</li>
<li>Copy the necessary headers provided by the interface. They might look like <code>Authorization: Bearer &lt;your-token&gt;</code>.</li>
</ul>
<p>Note: Notice you need to replace the whitespace between <code>Bearer</code> and your token with <code>%20</code> in the <code>OTEL_EXPORTER_OTLP_HEADERS</code> variable when using Python.</p>
<p>Alternatively you can use a different approach for authentication using API keys (see <a href="https://github.com/elastic/elastic-otel-python?tab=readme-ov-file#authentication">instructions</a>). If you are using our <a href="https://www.elastic.co/docs/current/serverless/general/what-is-serverless-elastic">serverless offering</a> you will need to use this approach instead.</p>
<p><strong>Set up the variables</strong>:</p>
<ul>
<li>Replace the placeholders in your script with the actual values obtained from the Elastic APM interface and execute it in your shell via the source command <code>source env.sh</code>.</li>
</ul>
<p>Below is a script to set these variables:</p>
<pre><code class="language-sh">#!/bin/bash
echo &quot;--- :otel: Setting OTEL variables&quot;
export OTEL_EXPORTER_OTLP_ENDPOINT='https://your-apm-server/otlp:443'
export OTEL_EXPORTER_OTLP_HEADERS='Authorization=Bearer%20your-token'
export OTEL_PYTHON_LOGGING_AUTO_INSTRUMENTATION_ENABLED=true
export OTEL_PYTHON_LOG_CORRELATION=true
export ELASTIC_OTEL_SYSTEM_METRICS_ENABLED=true
export OTEL_METRIC_EXPORT_INTERVAL=5000
export OTEL_LOGS_EXPORTER=&quot;otlp,console&quot;
</code></pre>
<p>With these variables set, we are ready for auto-instrumentation without needing to add anything to the code.</p>
<h4>Explanation of Variables</h4>
<ul>
<li>
<p><strong>OTEL_EXPORTER_OTLP_ENDPOINT</strong>: This variable specifies the endpoint to which OTLP data (traces, metrics, logs) will be sent. Replace <code>placeholder</code> with your actual OTLP endpoint.</p>
</li>
<li>
<p><strong>OTEL_EXPORTER_OTLP_HEADERS</strong>: This variable specifies any headers required for authentication or other purposes when sending OTLP data. Replace <code>placeholder</code> with your actual OTLP headers.</p>
</li>
<li>
<p><strong>OTEL_PYTHON_LOGGING_AUTO_INSTRUMENTATION_ENABLED</strong>: This variable enables auto-instrumentation for logging in Python, allowing logs to be automatically enriched with trace context.</p>
</li>
<li>
<p><strong>OTEL_PYTHON_LOG_CORRELATION</strong>: This variable enables log correlation, which includes trace context in log entries to correlate logs with traces.</p>
</li>
<li>
<p><strong>OTEL_METRIC_EXPORT_INTERVAL</strong>: This variable specifies the metric export interval in milliseconds, in this case 5s.</p>
</li>
<li>
<p><strong>OTEL_LOGS_EXPORTER</strong>: This variable specifies the exporter to use for logs. Setting it to &quot;otlp&quot; means that logs will be exported using the OTLP protocol. Adding &quot;console&quot; specifies that logs should be exported to both the OTLP endpoint and the console. In our case for better visibility on the infa side, we choose to export to console as well.</p>
</li>
<li>
<p><strong>ELASTIC_OTEL_SYSTEM_METRICS_ENABLED</strong>: It is needed to use this variable when using the Elastic distribution as by default it is set to false.</p>
</li>
</ul>
<p>Note: <strong>OTEL_METRICS_EXPORTER</strong> and <strong>OTEL_TRACES_EXPORTER</strong>: This variables specify the exporter to use for metrics/traces, and are set to &quot;otlp&quot; by default, which means that metrics and traces will be exported using the OTLP protocol.</p>
<h3>Running Python ETLs</h3>
<p>We run Python ETLs with the following command:</p>
<pre><code class="language-sh">OTEL_RESOURCE_ATTRIBUTES=&quot;service.name=x-ETL,service.version=1.0,deployment.environment=production&quot; &amp;&amp; opentelemetry-instrument python3 X_ETL.py 
</code></pre>
<h4>Explanation of the Command</h4>
<ul>
<li>
<p><strong>OTEL_RESOURCE_ATTRIBUTES</strong>: This variable specifies additional resource attributes, such as <a href="https://www.elastic.co/guide/en/observability/current/apm.html">service name</a>, service version and deployment environment, that will be included in all telemetry data, you can customize these values per your needs. You can use a different service name for each script.</p>
</li>
<li>
<p><strong>opentelemetry-instrument</strong>: This command auto-instruments the specified Python script for OpenTelemetry. It sets up the necessary hooks to collect traces, metrics, and logs.</p>
</li>
<li>
<p><strong>python3 X_ETL.py</strong>: This runs the specified Python script (<code>X_ETL.py</code>).</p>
</li>
</ul>
<h3>Tracing</h3>
<p>We export the traces via the default OTLP protocol.</p>
<p>Tracing is a key aspect of monitoring and understanding the performance of applications. <a href="https://www.elastic.co/guide/en/observability/current/apm-data-model-spans.html">Spans</a> form the building blocks of tracing. They encapsulate detailed information about the execution of specific code paths. They record the start and end times of activities and can have hierarchical relationships with other spans, forming a parent/child structure.</p>
<p>Spans include essential attributes such as transaction IDs, parent IDs, start times, durations, names, types, subtypes, and actions. Additionally, spans may contain stack traces, which provide a detailed view of function calls, including attributes like function name, file path, and line number, which is especially useful for debugging. These attributes help us analyze the script's execution flow, identify performance issues, and enhance optimization efforts.</p>
<p>With the default instrumentation, the whole Python script would be a single span. In our case we have decided to manually add specific spans per the different phases of the Python process, to be able to measure their latency, throughput, error rate, etc individually. This is how we define spans manually:</p>
<pre><code class="language-python">from opentelemetry import trace

if __name__ == &quot;__main__&quot;:

    tracer = trace.get_tracer(&quot;main&quot;)
    with tracer.start_as_current_span(&quot;initialization&quot;) as span:
            # Init code
            … 
    with tracer.start_as_current_span(&quot;search&quot;) as span:
            # Step 1 - Search code
            …
   with tracer.start_as_current_span(&quot;transform&quot;) as span:
           # Step 2 - Transform code
           …
   with tracer.start_as_current_span(&quot;load&quot;) as span:
           # Step 3 - Load code
           …
</code></pre>
<p>You can explore traces in the APM interface as shown below.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-your-python-data-pipelines-with-otel/Traces-APM-Observability-Elastic.png" alt="3 - APM Traces view" /></p>
<h3>Metrics</h3>
<p>We export metrics via the default OTLP protocol as well, such as CPU usage and memory. No extra code needs to be added in the script itself.</p>
<p>Note: Remember to set <code>ELASTIC_OTEL_SYSTEM_METRICS_ENABLED</code> to true.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-your-python-data-pipelines-with-otel/otel-metrics-apm-view.png" alt="4 - APM Metrics view" /></p>
<h3>Logging</h3>
<p>We export logs via the default OTLP protocol as well.</p>
<p>For logging, we modify the logging calls to add extra fields using a dictionary structure (bq_fields) as shown below:</p>
<pre><code class="language-python">        job.result()  # Waits for table load to complete
        job_details = client.get_job(job.job_id)  # Get job details

        # Extract job information
        bq_fields = {
            # &quot;slot_time_ms&quot;: job_details.slot_ms,
            &quot;job_id&quot;: job_details.job_id,
            &quot;job_type&quot;: job_details.job_type,
            &quot;state&quot;: job_details.state,
            &quot;path&quot;: job_details.path,
            &quot;job_created&quot;: job_details.created.isoformat(),
            &quot;job_ended&quot;: job_details.ended.isoformat(),
            &quot;execution_time_ms&quot;: (
                job_details.ended - job_details.created
            ).total_seconds()
            * 1000,
            &quot;bytes_processed&quot;: job_details.output_bytes,
            &quot;rows_affected&quot;: job_details.output_rows,
            &quot;destination_table&quot;: job_details.destination.table_id,
            &quot;event&quot;: &quot;BigQuery Load Job&quot;, # Custom event type
            &quot;status&quot;: &quot;success&quot;, # Status of the step (success/error)
            &quot;category&quot;: category # ETL category tag 
        }

        logging.info(&quot;BigQuery load operation successful&quot;, extra=bq_fields)
</code></pre>
<p>This code shows how to extract BQ job stats, execution time, bytes processed, rows affected and destination table among them. You can add other metadata like we do such as custom event type, status, and category.</p>
<p>Any calls to logging (of all levels above the set threshold, in this case INFO <code>logging.getLogger().setLevel(logging.INFO)</code>) will create a log that will be exported to Elastic. This means that in Python scripts that already use <code>logging</code> there is no need to make any changes to export logs to Elastic.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-your-python-data-pipelines-with-otel/otel-logs-apm-view.png" alt="5 - APM Logs view" /></p>
<p>For each of the log messages, you can go into the details view (click on the <code>…</code> when you hover over the log line and go into <code>View details</code>) to examine the metadata attached to the log message. You can also explore the logs in <a href="https://www.elastic.co/guide/en/kibana/8.14/discover.html">Discover</a>.</p>
<h4>Explanation of Logging Modification</h4>
<ul>
<li>
<p><strong>logging.info</strong>: This logs an informational message. The message &quot;BigQuery load operation successful&quot; is logged.</p>
</li>
<li>
<p><strong>extra=bq_fields</strong>: This adds additional context to the log entry using the <code>bq_fields</code> dictionary. This context can include details making the log entries more informative and easier to analyze. This data will be later used to set up alerts and data anomaly detection jobs.</p>
</li>
</ul>
<h2>Monitoring in Elastic's APM</h2>
<p>As shown, we can examine traces, metrics, and logs in the APM interface. To make the most out of this data, we make use on top of nearly the whole suit of features in Elastic Observability alongside Elastic Analytic's ML capabilities.</p>
<h3>Rules and Alerts</h3>
<p>We can set up rules and alerts to detect anomalies, errors, and performance issues in our scripts.</p>
<p>The <a href="https://www.elastic.co/guide/en/kibana/current/apm-alerts.html#apm-create-error-alert"><code>error count threshold</code> rule</a> is used to create a trigger when the number of errors in a service exceeds a defined threshold.</p>
<p>To create the rule go to Alerts and Insights -&gt; Rules -&gt; Create Rule -&gt; Error count threshold, set the error count threshold, the service or environment you want to monitor (you can also set an error grouping key across services), how often to run the check, and choose a connector.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-your-python-data-pipelines-with-otel/error-count-threshold.png" alt="6 - ETL Status Error Rule" /></p>
<p>Next, we create a rule of type <code>custom threshold</code> on a given ETL logs <a href="https://www.elastic.co/guide/en/kibana/current/data-views.html">data view</a> (create one for your index) filtering on &quot;labels.status: error&quot; to get all the logs with status error from any of the steps of the ETL which have failed. The rule condition is set to document count &gt; 0. In our case, in the last section of the rule config, we also set up Slack <a href="https://www.elastic.co/guide/en/kibana/current/alerting-getting-started.html">alerts</a> every time the rule is activated. You can pick from a long list of <a href="https://www.elastic.co/guide/en/kibana/current/action-types.html">connectors</a> Elastic supports.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-your-python-data-pipelines-with-otel/etl-fail-status-rule.png" alt="7 - ETL Status Error Rule" /></p>
<p>Then we can set up alerts for failures. We add status to the logs metadata as shown in the code sample below for each of the steps in the ETLs. It then becomes available in ES via <code>labels.status</code>.</p>
<pre><code class="language-python">logging.info(
            &quot;Elasticsearch search operation successful&quot;,
            extra={
                &quot;event&quot;: &quot;Elasticsearch Search&quot;,
                &quot;status&quot;: &quot;success&quot;,
                &quot;category&quot;: category,
                &quot;index&quot;: index,
            },
        )
</code></pre>
<h3>More Rules</h3>
<p>We could also add rules to detect anomalies in the execution time of the different spans we define. This is done by selecting transaction/span -&gt; Alerts and rules -&gt; Custom threshold rule -&gt; Latency. In the example below, we want to generate an alert whenever the search step takes more than 25s.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-your-python-data-pipelines-with-otel/apm_custom_threshold_latency.png" alt="8 - APM Custom Threshold - Latency" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-your-python-data-pipelines-with-otel/apm_custom_threshold_latency_2.png" alt="9 - APM Custom Threshold - Config" /></p>
<p>Alternatively, for finer-grained control, you can go with Alerts and rules -&gt; Anomaly rule, set up an anomaly job, and pick a threshold severity level.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-your-python-data-pipelines-with-otel/apm_anomaly_rule_config.png" alt="10 - APM Anomaly Rule - Config" /></p>
<h3>Anomaly detection job</h3>
<p>In this example we set an anomaly detection job on the number of documents before transform.</p>
<p>We set up an <a href="https://www.elastic.co/guide/en/machine-learning/current/ml-ad-run-jobs.html">Anomaly Detection jobs</a> on the number of document before the transform using the [Single metric job] (<a href="https://www.elastic.co/guide/en/machine-learning/current/ml-anomaly-detection-job-types.html#multi-metric-jobs">https://www.elastic.co/guide/en/machine-learning/current/ml-anomaly-detection-job-types.html#multi-metric-jobs</a>) to detect any anomalies with the incoming data source.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-your-python-data-pipelines-with-otel/single-metrics.png" alt="11 - Single Metrics" /></p>
<p>In the last step, you can create alerting similarly to what we did before to receive alerts whenever there is an anomaly detected, by setting up a severity level threshold. Using the anomaly score which is assigned to every anomaly, every anomaly is characterized by a severity level.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-your-python-data-pipelines-with-otel/anomaly-detection-alerting-1.png" alt="12 - Anomaly detection Alerting - Severity" /></p>
<p>Similarly to the previous example, we set up a Slack connector to receive alerts whenever an anomaly is detected.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-your-python-data-pipelines-with-otel/anomaly-detection-alerting-connectors.png" alt="13 - Anomaly detection Alerting - Connectors" /></p>
<p>You can go to your custom dashboard by going to Add Panel -&gt; ML -&gt; Anomaly Swim Lane -&gt; Pick your job.</p>
<p>Similarly, we add jobs for the number of documents after the transform, and a Multi-Metric one on the <code>execution_time_ms</code>, <code>bytes_processed</code> and <code>rows_affected</code> similarly to how it was done in <a href="https://www.elastic.co/observability-labs/blog/monitor-dbt-pipelines-with-elastic-observability">Monitor your DBT pipelines with Elastic Observability</a>.</p>
<h2>Custom Dashboard</h2>
<p>Now that your logs, metrics, and traces are in Elastic, you can use the full potential of our Kibana dashboards to extract the most from them. We can create a custom dashboard like the following one: a pie chart based on <code>labels.event</code> (category field for every type of step in the ETLs), a chart for every type of step broken down by status, a timeline of steps broken down by status, BQ stats for the ETL, and anomaly detection swim lane panels for the various anomaly jobs.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-your-python-data-pipelines-with-otel/custom_dashboard.png" alt="14 - Custom Dashboard" /></p>
<h2>Conclusion</h2>
<p>Elastic’s APM, in combination with other Observability and ML features, provides a unified view of our data pipelines, allowing us to bring a lot of value with minimal code changes:</p>
<ul>
<li>Logging of new logs (no need to add custom logging) alongside their execution context</li>
<li>Monitor the runtime behavior of our models</li>
<li>Track data quality issues</li>
<li>Identify and troubleshoot real-time incidents</li>
<li>Optimize performance bottlenecks and resource usage</li>
<li>Identify dependencies on other services and their latency</li>
<li>Optimize data transformation processes</li>
<li>Set up alerts on latency, data quality issues, error rates of transactions or CPU usage)</li>
</ul>
<p>With these capabilities, we can ensure the resilience and reliability of our data pipelines, leading to more robust and accurate BI system and reporting.</p>
<p>In conclusion, setting up OpenTelemetry (OTEL) in Python for data pipeline observability has significantly improved our ability to monitor, detect, and resolve issues proactively. This has led to more reliable data transformations, better resource management, and enhanced overall performance of our data transformation, BI and Machine Learning systems.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/monitor-your-python-data-pipelines-with-otel/main_image.jpg" length="0" type="image/jpg"/>
        </item>
    </channel>
</rss>