Loading

Troubleshoot timestamp data quality issues

In Elasticsearch date fields, you can use any valid past, present, or future date as the value, as long as it matches the field's date format.

Make sure your stored date fields are valid. When ensuring timestamp accuracy, you might encounter these common client-side data quality issues:

  • Operating system timezone setting conflicts
  • Incorrectly formatted date values
  • Unexpected values for the specific date field, for example:
    • future dates that don't make sense for the data
    • default dates, such as the 1970 Unix epoch
    • negative dates
    • time-bucketed dates
    • truncated strings

This page summarizes the symptoms of these issues and helps you address them, focusing on unexpected future dates in the @timestamp field.

Tip

When possible, make sure to resolve timestamp data quality issues in the data itself, rather than working around them in Elasticsearch.

Timestamp data quality issues can strain resources and cause unexpected search results. They can especially affect the following features:

Common symptoms of these issues include:

Tip

You can reduce the performance impact of these issues even when the underlying timestamp data quality problems still exist.

Here's an example that illustrates how timestamp data quality issues can affect the search performance of your cluster. Suppose a single on-prem host has a misconfigured system clock, causing its @timestamp field to log timestamps one year in the future. In this example:

After 30 days, there are 150 shards, half of them hosted in the frozen tier.

If the misconfigured host didn't ingest data during the 30-day period, the following issues can occur when a user selects a now-15m time window in the dashboard:

  1. The search pre-filter shows only the five shards from the latest backing index.
  2. Data is read from the latest five shards only, and the remaining search queries and aggregations run on that subset of data.
  3. Inspect reports 5 shards searched and 145 shards skipped.

If the misconfigured host was ingesting data during the 30-day period, the following issues can occur:

  1. The pre-filter can't filter out backing indices, so it allows searches against all 150 shards backing this data stream.
  2. Half of the shards are in the data_frozen data tier, which is intended for rarely queried data. The frozen tier is usually provisioned with low CPU relative to high data volume, so searches run slower. Additionally, in the frozen tier, indices are partially mounted searchable snapshots, which slows down searches because data must be fetched from the snapshot repository.
  3. The cluster searches all 150 shards for timestamps within the desired window. This resource usage can happen even if no documents end up matching the selected time range.
  4. Inspect reports 150 shards searched and 0 skipped.

In both scenarios, search is more computationally expensive and can return incorrect results because of the host's misconfigured timestamps. The next section explains how to investigate the scope of a timestamp data quality issue.

Timestamp data quality issues can be difficult to notice if they're not actively causing performance strain. For best results, make sure you're familiar with the typical patterns and expected trends in your data (also referred to as "seasonal patterns"), so you can spot anomalies.

To check for date values far into the past or future, you can use the following options:

  • To review partially mounted searchable snapshots and their @timestamp date field only, use a cluster state request:

    				GET _cluster/state?filter_path=metadata.indices.*.timestamp_range
    		

    To filter and format the results, you can use third-party tools such as jq. For example, to see a list of indices with a maximum timestamp in the future:

    cat cluster_state.json | jq -cMr '.metadata.indices| to_entries| sort_by(.key)| .[]| .value.timestamp_range as $ts| select($ts.min)| {min:($ts.min/1000.0 | todate),max:($ts.max/1000.0 | todate), index:.key}' | jq -r --arg now "$(date -u +"%Y-%m-%dT%H:%M:%SZ")" 'select(.max > $now)'
    		
  • To list the top 200 aggregated indices by the number of documents whose timestamps are in the future, use a search request:

    				GET */_search
    					{ "size": 0,
      "aggs": { "0": {
        "terms": {
          "field": "_index",
          "order": { "_count": "desc" },
          "size": 200
        }}
      },
      "query": {
        "bool": {
          "filter": [ { "range": { "@timestamp": { "gte": "now" }}}]
        }
      }
    }
    		
  • To list individual indices' minimum and maximum timestamps, use a search request.

    Warning

    This search can be resource-intensive, depending on your search target scope and the hardware profiles of nodes hosting related shards.

    				GET my_datastream/_search
    					{ "size": 0,
      "aggs": { "2": {
        "aggs": {
          "min": {"min": {"field": "@timestamp"} },
          "max": {"max": {"field": "@timestamp"} }
        },
        "terms": {
          "field": "_index",
          "order": {"_key": "asc"},
          "size": 200
        }
      }}
    }
    		

If you find future dates, check for patterns in the data distribution:

				GET my_datastream/_search?filter_path=aggregations
					{ "size": 0,
  "query": {
    "bool": {
      "filter": [ { "range": { "@timestamp": { "gte": "now" }}}]
    }
  },
  "aggs": {
    "time_buckets": {
      "auto_date_histogram": {
        "field": "@timestamp",
        "buckets": 30,
        "format": "yyyy-MM-dd"
      }
    }
  }
}
		

The rest of this page explains how to reduce the performance impact of these issues and clean up problematic data.

Even when timestamp data quality issues remain in your data, you can reduce their performance impact by adjusting how scheduled tasks and searches run.

Time series data is frequently searched by date field, and the most common date field is @timestamp. By default, this field's value reflects when the event originated, as reported by the source. This is the default date field when creating a data view. Discover and Dashboard objects use data views to resolve and search data.

For scheduled tasks that run without user interaction, consider searching on the event.ingested date field instead of @timestamp. By default, this field's value reflects when the event was ingested into the cluster. If event.ingested isn't already populated, refer to Troubleshoot ingest pipelines to add the field to your data with a custom ingest pipeline.

These common scheduled tasks benefit from using event.ingested:

The event.ingested approach is recommended for Elastic Security data sources and is automatically used in Observability rules.

For Elastic Security detection rules, also consider enabling the advanced setting that ensures @timestamp is not used as a fallback for timestamp overrides.

You can use these Kibana advanced settings to exclude the data_cold and data_frozen data tiers from searches:

You can also use a Query DSL boolean query filter out specific data tiers. Filtering with a query string query is insufficient.

For example, you can filter out data_cold and data_frozen with the following boolean query:

{
   "bool":{
      "must_not":{
         "terms":{
            "_tier":[ "data_cold", "data_frozen" ]
         }
      }
   }
}
		

After investigating timestamp data quality and reviewing best practices, clean up any issues by deleting or modifying the problematic data.

To remove invalid data, use one of these methods:

The following example steps modify invalid data by updating the @timestamp field to the current time:

  1. Create an ingest pipeline that sets the @timestamp date field to the current timestamp:

    				PUT _ingest/pipeline/update_date
    					{
      "processors": [
        {
          "rename": {
            "description": "(Optional) Cache the previous timestamp in a new field",
            "field": "@timestamp",
            "target_field": "old_timestamp"
          }
          ,
          "set": {
            "description": "Override the @timestamp value to the ingested time",
            "field": "_source.@timestamp",
            "value": "{{_ingest.timestamp}}"
          }
        }
      ]
    }
    		
  2. Run the data through the new pipeline to modify the value, using one of these approaches:

    • To modify the value across the entire index, reindex to a new index.

      				POST _reindex
      					{
        "source": {
          "index": ["my-index-000001", "my-index-000002"]
        },
        "dest": {
          "index": "my-new-index-000001",
          "pipeline": "update_date"
        }
      }
      		
    • To target specific documents, use an update by query request within the existing index.

      				POST my_index/_update_by_query?pipeline=update_date
      					{
        "query": {
          "range": {
            "@timestamp": {
              "gt": "now"
            }
          }
        }
      }
      		
Tip

To modify documents in a searchable snapshot index, you must first restore it to a regular index.