When TSDS meets ILM: Designing time series data streams that don't reject late data

How TSDS time bounds interact with ILM phases; and how to design policies that tolerate late-arriving metrics.

Elasticsearch allows you to index data quickly and in a flexible manner. Try it free in the cloud or run it locally to see how easy indexing can be.

Recently, I migrated a customer's metrics cluster from "everything in the hot tier" to a hot/cold/frozen architecture. It was a change I’d performed dozens of times before. Within minutes, Logstash stopped advancing data entirely.

Elasticsearch was rejecting late-arriving metrics. Those rejections caused the pipeline to fall behind, resulting in more late data, which triggered even more rejections. Eventually, the pipeline stalled completely.

We had to restore from snapshot, reindex the data, and redesign the ingestion pipeline to recover.

The root cause wasn't index lifecycle management (ILM) itself. It was time series data streams (TSDS) and how they enforce time‑bound backing indices.

TSDS can reduce storage requirements for metrics by 40–70%, but the architectural changes that make TSDS efficient also alter how indices behave over time. Those changes matter when designing ILM policies or when your ingestion pipelines may produce late‑arriving data.

TL;DR

When using TSDS:

  • Backing indices only accept documents within a specific time window.
  • If late data arrives after an index moves to cold or frozen, Elasticsearch rejects those documents or routes them to the failure store, if configured.

Design rule:

What is a time series data stream?

A time series data stream (TSDS) is a specialized data stream optimized for metrics data. Data is routed so that related documents are located within the same shards, optimizing them for query and retrieval. Here’s how Elasticsearch does it:

Each document contains:

  • A timestamp.
  • Dimension fields identifying the time series.
  • Metric fields representing measured values.

Examples include:

  • CPU usage per host.
  • Request latency per service.
  • Temperature readings per sensor.

Dimensions identify what we want to measure, while metrics represent values that change over time.

Dimensions

Dimensions describe the measured entity.

Examples:

We define them in mappings with:

Metrics

Metrics represent numeric values and are defined using:

Common metric types:

  • Gauge: Values that rise and fall.
  • Counter: Values that increase until reset.

Elastic Agent primarily collects metrics and logs data, so even if you haven’t enabled any TSDS indices by hand, you may still have them in your cluster.

The _tsid field

Elasticsearch internally generates a _tsid value from dimension fields. This allows documents with identical dimensions to be routed to the same shard, improving:

  • Compression.
  • Query locality.
  • Aggregation performance.

The key difference: Time‑bound backing indices

Traditional data streams always write to the most recent backing index, called the write index, but TSDS behaves differently.

Each TSDS backing index has a defined time window and only accepts documents with @timestamp values that fall in that window:

When a document is indexed, Elasticsearch routes it to the backing index responsible for that timestamp, meaning that, unlike traditional indices, a TSDS may write to multiple backing indices simultaneously.

For example:

  • Real‑time data → newest index.
  • Late data → earlier index covering that time range.

Designing for late‑arriving data

Real ingestion pipelines rarely deliver metrics perfectly on time. Metrics can be delayed by network outages, backlogs along the way, batch ingestion, and loss of edge devices, which reconnect and start to catch up.

Traditional indices quietly absorb those delays. TSDS does not.

If a document's timestamp falls outside the range of writable backing indices, Elasticsearch rejects it, meaning your ILM policy must account for late data.

The critical constraint

Backing indices must remain writable long enough to accept delayed data.

In practical terms:

Because ILM measures ages from rollover, the operational rule becomes:

For example, if metrics may arrive up to six hours late, indices must remain writable at least six hours after rollover.

Failing to account for this constraint was exactly what caused the ingestion failure described earlier. Late-arriving data was directed to an earlier index, which was already in the cold tier and therefore write-blocked.

Handling rejected documents

When TSDS rejects a document, Elasticsearch returns an error, indicating that the timestamp doesn’t fall within the range of writable indices. How your ingestion pipeline handles that error determines whether you lose data or stall ingestion.

The primary mechanism for handling rejected documents is the failure store.

Failure store (recommended in Elasticsearch 9.1+)

Elasticsearch 9.1 introduced the failure store, which automatically captures rejected documents. Instead of returning errors to clients, Elasticsearch writes failed documents to a dedicated failure index inside the data stream.

You can inspect failures using:

Using the failure store prevents ingestion pipelines from choking on rejection errors while preserving failed data for analysis or reindexing.

Monitoring for rejection issues

Late‑arrival problems usually appear first as ingestion anomalies. You may notice them first as:

  • Sudden drops in indexing rate.
  • Spikes in rejected documents.
  • A growing number of failure store entries.
  • Mismatches between pipeline input and output counts.

Alerting on these signals allows operators to detect issues before pipelines stall. Workflows, machine learning jobs, and other mechanisms can be used to automate detection and notification.

Migration checklist for TSDS + ILM

If you're migrating a metrics cluster to TSDS, introducing ILM tiering, or upgrading to an Elasticsearch version where metrics are TSDS by default, review these items first.

1. Measure ingestion latency

Before changing ILM policies, determine:

  • Normal ingestion delay.
  • Worst-case delay during incidents.
  • Delays caused by batch pipelines.

Your ILM design must accommodate the maximum realistic delay.

2. Verify index time windows

Inspect your TSDS backing indices:

Look for:

  • time_series.start_time
  • time_series.end_time

These bounds determine which indices can accept documents. Understanding these windows can help you determine how late data can be before it’s rejected.

3. Size the hot tier for late arrivals

Ensure backing indices remain writable long enough for delayed data.

Operational rule:

  • warm_min_age > rollover_max_age + maximum_expected_lateness

Remember, indices must remain writable for at least six hours if metrics may arrive six hours late.

4. Decide how to handle rejected documents

Choose a strategy before enabling TSDS:

  • Failure store (recommended in Elasticsearch 9.1+).
  • Logstash dead letter queue.
  • Fallback index for late arrivals.
  • Accepting limited data loss.

5. Monitor ingestion health

Add alerts for:

  • Indexing rate drops.
  • Rejected documents.
  • Failure store growth.
  • Pipeline input/output mismatches.

Late data issues often appear first as ingestion anomalies.

Summary

Time series data streams provide major storage and performance improvements for metrics workloads, but they introduce an important architectural change: Backing indices are time‑bound, which affects how ILM behaves.

When using TSDS:

  • Indices must remain writable long enough to accept delayed data.
  • Ingestion pipelines should handle rejected documents safely.

The key rule to remember is:

If you design ILM policies around that constraint, TSDS works extremely well for metrics workloads.

Ignore it, though, and your ingestion pipeline may discover those time boundaries the hard way.

このコンテンツはどれほど役に立ちましたか?

役に立たない

やや役に立つ

非常に役に立つ

関連記事

最先端の検索体験を構築する準備はできましたか?

十分に高度な検索は 1 人の努力だけでは実現できません。Elasticsearch は、データ サイエンティスト、ML オペレーター、エンジニアなど、あなたと同じように検索に情熱を傾ける多くの人々によって支えられています。ぜひつながり、協力して、希望する結果が得られる魔法の検索エクスペリエンスを構築しましょう。

はじめましょう