Elastic Observability Labs - Articles by Nicolas Ruflin

How to use Elasticsearch and Time Series Data Streams for observability metrics

Thu, 04 May 2023 00:00:00 GMT

Elasticsearch is used for a wide variety of data types — one of these is metrics. With the introduction of Metricbeat many years ago and later our APM Agents, the metric use case has become more popular. Over the years, Elasticsearch has made many improvements on how to handle things like metrics aggregations and sparse documents. At the same time, TSVB visualizations were introduced to make visualizing metrics easier. One concept that was missing that exists for most other metric solutions is the concept of time series with dimensions.

Mid 2021, the Elasticsearch team embarked on making Elasticsearch a much better fit for metrics. The team created Time Series Data Streams (TSDS), which were released in 8.7 as generally available (GA).

This blog post dives into how TSDS works and how we use it in Elastic Observability, as well as how you can use it for your own metrics.

A quick introduction to TSDS

Time Series Data Streams (TSDS) are built on top of data streams in Elasticsearch that are optimized for time series. To create a data stream for metrics, an additional setting on the data stream is needed. As we are using data streams, first an Index Template has to be created:

PUT _index_template/metrics-laptop
{
  "index_patterns": [
    "metrics-laptop-*"
  ],
  "data_stream": {},
  "priority": 200,
  "template": {
    "settings": {
      "index.mode": "time_series"
    },
    "mappings": {
      "properties": {
        "host.name": {
          "type": "keyword",
          "time_series_dimension": true
        },
        "packages.sent": {
          "type": "integer",
          "time_series_metric": "counter"
        },
        "memory.usage": {
          "type": "double",
          "time_series_metric": "gauge"
        }
      }
    }
  }
}

Let's have a closer look at this template. On the top part, we mark the index pattern with metrics-laptop-*. Any pattern can be selected, but it is recommended to use the data stream naming scheme for all your metrics. The next section sets the "index.mode": "time_series" in combination with making sure it is a data_stream: "data_stream": {}.

Dimensions

Each time series data stream needs at least one dimension. In the example above, host.name is set as a dimension field with "time_series_dimension": true. You can have up to 16 dimensions by default. Not every dimension must show up in each document. The dimensions define the time series. The general rule is to pick fields as dimensions that uniquely identify your time series. Often this is a unique description of the host/container, but for some metrics like disk metrics, the disk id is needed in addition. If you are curious about default recommended dimensions, have a look at this ECS contribution with dimension properties.

Reduced storage and increased query speed

At this point, you already have a functioning time series data stream. Setting the index mode to time series automatically turns on synthetic source. By default, Elasticsearch typically duplicates data three times:

row-oriented storage (_source field)
column-oriented storage (doc_values: true for aggregations)
indices (index: true for filtering and search)

With synthetic source, the _source field is not persisted; instead, it is reconstructed from the doc values. Especially in the metrics use case, there are little benefits to keeping the source.

Not storing it means a significant reduction in storage. Time series data streams sort the data based on the dimensions and the time stamp. This means data that is usually queried together is stored together, which speeds up query times. It also means that the data points for a single time series are stored alongside each other on disk. This enables further compression of the data as the rate at which a counter increases is often relatively constant.

Metric types

But to benefit from all the advantages of TSDS, the field properties of the metrics fields must be extended with the time_series_metric: {type}. Several types are supported — as an example, gauge and counter were used above. Giving Elasticsearch knowledge about the metric type allows Elasticsearch to offer more optimized queries for the different types and reduce storage usage further.

When you create your own templates for data streams under the data stream naming scheme, it is important that you set "priority": 200 or higher, as otherwise the built-in default template will apply.

Ingest a document

Ingesting a document into a TSDS isn't in any way different from ingesting documents into Elasticsearch. You can use the following commands in Dev Tools to add a document, and then search for it and also check out the mappings. Note: You have to adjust the @timestamp field to be close to your current date and time.

# Add a document with `host.name` as the dimension
POST metrics-laptop-default/_doc
{
  # This timestamp neesd to be adjusted to be current
  "@timestamp": "2023-03-30T12:26:23+00:00",
  "host.name": "ruflin.com",
  "packages.sent": 1000,
  "memory.usage": 0.8
}

# Search for the added doc, _source will show up but is reconstructed
GET metrics-laptop-default/_search

# Check out the mappings
GET metrics-laptop-default

If you do search, it still shows _source but this is reconstructed from the doc values. The additional field added above is @timestamp. This is important as it is a required field for any data stream.

Why is this all important for Observability?

One of the advantages of the Elastic Observability solution is that in a single storage engine, all signals are brought together in a single place. Users can query logs, metrics, and traces together without having to jump from one system to another. Because of this, having a great storage and query engine not only for logs but also metrics is key for us.

Usage of TSDS in integrations

With integrations, we give our users an out of the box experience to integrate with their infrastructure and services. If you are using our integrations, eventually you will automatically get all the benefits of TSDS for your metrics assuming you are on version 8.7 or newer.

Currently we are working through the list of our integration packages, add the dimensions, metric type fields and then turn on TSDS for the metrics data streams. What this means is as soon as the package has all properties enabled, the only thing you have to do is upgrade the integration and everything else will happen automatically in the background.

To visualize your time series in Kibana, use Lens, which has native support built in for TSDS.

Learn more

If you switch over to TSDS, you will automatically benefit from all the future improvements Elasticsearch is making for metrics time series, be it more efficient storage, query performance, or new aggregation capabilities. If you want to learn more about how TSDS works under the hood and all available config options, check out the TSDS documentation. What Elasticsearch supports in 8.7 is only the first iteration of the metrics time series in Elasticsearch.

TSDS can be used since 8.7 and will be in more and more of our integrations automatically when integrations are upgraded. All you will notice is lower storage usage and faster queries. Enjoy!

Simplifying log data management: Harness the power of flexible routing with Elastic

Tue, 13 Jun 2023 00:00:00 GMT

In Elasticsearch 8.8, we’re introducing the reroute processor in technical preview that makes it possible to send documents, such as logs, to different data streams, according to flexible routing rules. When using Elastic Observability, this gives you more granular control over your data with regard to retention, permissions, and processing with all the potential benefits of the data stream naming scheme. While optimized for data streams, the reroute processor also works with classic indices. This blog post contains examples on how to use the reroute processor that you can try on your own by executing the snippets in the Kibana dev tools.

Elastic Observability offers a wide range of integrations that help you to monitor your applications and infrastructure. These integrations are added as policies to Elastic agents, which help ingest telemetry into Elastic Observability. Several examples of these integrations include the ability to ingest logs from systems that send a stream of logs from different applications, such as Amazon Kinesis Data Firehose, Kubernetes container logs, and syslog. One challenge is that these multiplexed log streams are sending data to the same Elasticsearch data stream, such as logs-syslog-default. This makes it difficult to create parsing rules in ingest pipelines and dashboards for specific technologies, such as the ones from the Nginx and Apache integrations. That’s because in Elasticsearch, in combination with the data stream naming scheme, the processing and the schema are both encapsulated in a data stream.

The reroute processor helps you tease apart data from a generic data stream and send it to a more specific one. You may use that mechanism to send logs to a data stream that is set up by the Nginx integration, for example, so that the logs are parsed with that integration and you can use the integration’s prebuilt dashboards or create custom ones with the fields, such as the url, the status code, and the response time that the Nginx pipeline has parsed out of the Nginx log message. You can also split out/separate regular Nginx logs and errors with the reroute processor, providing further separation ability and categorization of logs.

Example use case

To use the reroute processor, first:

Ensure you are on Elasticsearch 8.8
Ensure you have permissions to manage indices and data streams
If you don’t already have an account on Elastic Cloud, sign up for one

Next, you’ll need to set up a data stream and create a custom Elasticsearch ingest pipeline that is called as the default pipeline. Below we go through this step by step for the “mydata” data set that we’ll simulate ingesting container logs into. We start with a basic example and extend it from there.

The following steps should be utilized in the Elastic console, which is found at Management -> Dev tools -> Console. First, we need an an ingest pipeline and a template for the data stream:

PUT _ingest/pipeline/logs-mydata
{
  "description": "Routing for mydata",
  "processors": [
    {
      "reroute": {
      }
    }
  ]
}

This creates an ingest pipeline with an empty reroute processor. To make use of it, we need an index template:

PUT _index_template/logs-mydata
{
  "index_patterns": [
    "logs-mydata-*"
  ],
  "data_stream": {},
  "priority": 200,
  "template": {
    "settings": {
      "index.default_pipeline": "logs-mydata"
    },
    "mappings": {
      "properties": {
        "container.name": {
          "type": "keyword"
        }
      }
    }
  }
}

The above template is applied to all data that is shipped to logs-mydata-*. We have mapped container.name as a keyword, as this is the field we will be using for routing later on. Now, we send a document to the data stream and it will be ingested into logs-mydata-default:

POST logs-mydata-default/_doc
{
  "@timestamp": "2023-05-25T12:26:23+00:00",
  "container": {
    "name": "foo"
  }
}

We can check that it was ingested with the command below, which will show 1 result.

GET logs-mydata-default/_search

Without modifying the routing processor, this already allows us to route documents. As soon as the reroute processor is specified, it will look for data_stream.dataset and data_stream.namespace fields by default and will send documents to the corresponding data stream, according to the data stream naming scheme logs--. Let’s try this out:

POST logs-mydata-default/_doc
{
  "@timestamp": "2023-03-30T12:27:23+00:00",
  "container": {
"name": "foo"
  },
  "data_stream": {
    "dataset": "myotherdata"
  }
}

As can be seen with the GET logs-mydata-default/_search command, this document ended up in the logs-myotherdata-default data stream. But instead of using default rules, we want to create our own rules for the field container.name. If the field is container.name = foo, we want to send it to logs-foo-default. For this we modify our routing pipeline:

PUT _ingest/pipeline/logs-mydata
{
  "description": "Routing for mydata",
  "processors": [
    {
      "reroute": {
        "tag": "foo",
        "if" : "ctx.container?.name == 'foo'",
        "dataset": "foo"
      }
    }
  ]
}

Let's test this with a document:

POST logs-mydata-default/_doc
{
  "@timestamp": "2023-05-25T12:26:23+00:00",
  "container": {
    "name": "foo"
  }
}

While it would be possible to specify a routing rule for each container name, you can also route by the value of a field in the document:

PUT _ingest/pipeline/logs-mydata
{
  "description": "Routing for mydata",
  "processors": [
    {
      "reroute": {
        "tag": "mydata",
        "dataset": [
          "{{container.name}}",
          "mydata"
        ]
      }
    }
  ]
}

In this example, we are using a field reference as a routing rule. If the container.name field exists in the document, it will be routed — otherwise it falls back to mydata. This can be tested with:

POST logs-mydata-default/_doc
{
  "@timestamp": "2023-05-25T12:26:23+00:00",
  "container": {
    "name": "foo1"
  }
}

POST logs-mydata-default/_doc
{
  "@timestamp": "2023-05-25T12:26:23+00:00",
  "container": {
    "name": "foo2"
  }
}

This creates the data streams logs-foo1-default and logs-foo2-default.

NOTE: There is currently a limitation in the processor that requires the fields specified in a {{field.reference}} to be in a nested object notation. A dotted field name does not currently work. Also, you’ll get errors when the document contains dotted field names for any data_stream.* field. This limitation will be fixed in 8.8.2 and 8.9.0.

API keys

When using the reroute processor, it is important that the API keys specified have permissions for the source and target indices. For example, if a pattern is used for routing from logs-mydata-default, the API key must have write permissions for logs-*-* as data could end up in any of these indices (see example further down).

We’re currently working on extending the API key permissions for our integrations so that they allow for routing by default if you’re running a Fleet-managed Elastic Agent.

If you’re using a standalone Elastic Agent, or any other shipper, you can use this as a template to create your API key:

POST /_security/api_key
{
  "name": "ingest_logs",
  "role_descriptors": {
    "ingest_logs": {
      "cluster": [
        "monitor"
      ],
      "indices": [
        {
          "names": [
            "logs-*-*"
          ],
          "privileges": [
            "auto_configure",
            "create_doc"
          ]
        }
      ]
    }
  }
}

Future plans

In Elasticsearch 8.8, the reroute processor was released in technical preview. The plan is to adopt this in our data sink integrations like syslog, k8s, and others. Elastic will provide default routing rules that just work out of the box, but it will also be possible for users to add their own rules. If you are using our integrations, follow this guide on how to add a custom ingest pipeline.

Try it out!

This blog post has shown some sample use cases for document based routing. Try it out on your data by adjusting the commands for index templates and ingest pipelines to your own data, and get started with Elastic Cloud through a 7-day free trial. Let us know via this feedback form how you’re planning to use the reroute processor and whether you have suggestions for improvement.