Control the phase transition timings in ILM using the origination date | Elastic Blog
Engineering

Control the phase transition timings in ILM using the origination date

As part of Elasticsearch 7.5.0, we introduced a couple of ways to control the index age math that’s used by index lifecycle management (ILM) for phase timings calculations using the origination_date index lifecycle settings. This means you can now tell Elasticsearch how old your data is, which is pretty handy if you’re indexing data that’s older than today-days-old.

If you’re new to ILM, let’s take a quick look at how index age ties into phase transition...

Index age and ILM phase transitions

In ILM, an index enters a phase based on the min_age parameter. The actions defined in a phase will not execute until the index becomes older than the configured min_age. If not configured, the min_age defaults to 0, meaning the index will transition from one phase to the next immediately after the actions in the phase are completed.

An index’s age is calculated based on the date the index was created. To illustrate this, let’s say we want to allow our ILM managed indexes to be written to for 7 days, then mark them as read-only. Overall, we’ll keep them for 30 days, at which point the data is not relevant to us anymore and we’d like to delete the index. We could define this policy as:

PUT /_ilm/policy/readonly_and_delete_policy 
{ 
  "policy": { 
    "phases": { 
      "warm": { 
        "min_age": "7d", 
        "actions": { 
          "readonly": {} 
        } 
      }, 
      "delete": { 
        "min_age": "30d", 
        "actions": { 
          "delete": {} 
        } 
      } 
    } 
  } 
}

In the above example, an index will be eligible to enter the warm phase 7 days after it was created, at which point the read-only action will set the index to be read-only. The index will not be able to enter the delete phase until 30 days since its creation date elapse. Once the index turns 30 days old 🎂 it will enter the delete phase and be deleted.

This is great for automatically moving data to new phases, but what if data creation date != data index date?

What if there’s a gap between the index age and the data age?

Organizations work with data in different ways, but a few common patterns see 1) data transition from other systems of records into Elasticsearch at various points in the data’s lifecycle, or 2) data already in Elasticsearch is reindexed into new indexes. From an index age perspective, this is all new data (indices were just created to accommodate these data migrations), but the data they accommodate is actually older (coming Back To The Future ®) and might only be relevant for X amount of time since it was created, not since it reached Elasticsearch.

The new origination_date concept is meant to help close this data-index age gap by indicating an index is storing older data. This lets us calculate index age (in the context of ILM timings), based on the origination date, rather than the index creation date.

There are two new index settings that control the origination_date, namely, the index.lifecycle.parse_origination_date and the index.lifecycle.origination_date. Let’s see them in action.

Using origination_date with ILM phases

index.lifecycle.parse_origination_date

Configuring the index.lifecycle.parse_origination_date to true will instruct ILM to parse the managed index name using the ^.*-{date_format}-\\d+ pattern and set the parsed date as the origination date for the index.

The supported date_format is yyyy.MM.dd, so events-2020.01.01-000001 is a valid index name which would have the 1st of January 2020 set as the origination date.

The -000001 trailing digits are optional as they are usually used in the context of policies that make use of a rollover action, so events-2020.01.01 is also a valid index name.

Let’s say that on January 3, 2020, we create an index and attach the above readonly_and_delete_policy <1> (which marks the index as read-only after 7 days and deletes it after 30 days).

We start ingesting live/real-time data, but also need to ingest a batch of events that originated between the 1st and 3rd of January from another system. The entire dataset needs to go through the “mark as read-only on January 8 and delete on January 31st” lifecycle.

In order to achieve this, and letting ILM parse the origination date for us, the index would have to be named events-2020.01.01. Let’s create it:

PUT /events-2020.01.01 
{ 
  "settings" : { 
    "index" : { 
      "lifecycle.name": "readonly_and_delete_policy", 
      "lifecycle.parse_origination_date": true # <1> 
    } 
  } 
}

<1> Enable the parsing of the origination date from the index name.

Let’s check the index’s lifecycle status using the explain api:

GET /events-2020.01.01/_ilm/explain 
> { 
    "indices" : { 
      "events-2020.01.01" : { 
        "index" : "events-2020.01.01", 
        "managed" : true, 
        "policy" : "readonly_and_delete_policy", 
        "lifecycle_date_millis" : 1577836800000, # <1> 
        "age" : "2.48d", # <2> 
        "phase" : "new", # <3> 
        "phase_time_millis" : 1578049353000, 
        "action" : "complete", 
        "action_time_millis" : 1578049353000, 
        "step" : "complete", 
        "step_time_millis" : 1578049353000 
      } 
    } 
  }

<1> The lifecycle_date_millis was correctly parsed to 1577836800000 (Wednesday, 1st of January 2020, 00:00:00).

<2> The index age of 2+d is calculated based on the origination date parsed from the index name and the current time, on Friday 3rd of January, when the policy was applied to the index.

<3> The index is in the new phase as the 7 days min_age condition for the warm phase has not been met yet.

index.lifecycle.origination_date

While the origination date can be parsed from the index name, it may also be set manually in the case of a non-conforming index name or manual age needs.

Coming back to our use case, we configure the index.lifecycle.origination_date for the events index to 1577836800000 (namely Wednesday, 1st of January 2020 00:00:00) <2> when creating the index (see below). 

PUT /events 
{ 
  "settings" : { 
    "index" : { 
      "lifecycle.name": "readonly_and_delete_policy", # <1> 
      "lifecycle.origination_date": 1577836800000 # <2> 
    } 
  } 
}

Let’s check the index’s current lifecycle status using the explain api:

GET /events/_ilm/explain 
> { 
    "indices" : { 
      "events" : { 
        "index" : "events", 
        "managed" : true, 
        "policy" : "readonly_and_delete_policy", 
        "lifecycle_date_millis" : 1577836800000, # <1> 
        "age" : "2.46d", # <2> 
        "phase" : "new", # <3> 
        "phase_time_millis" : 1578049353000, 
        "action" : "complete", 
        "action_time_millis" : 1578049353000, 
        "step" : "complete", 
        "step_time_millis" : 1578049353000 
      } 
    } 
  }

<1> We can see the lifecycle_date_millis timestamp, used to calculate the min_age, set to the value we configured using the index.lifecycle.origination_date setting.

<2> Given that we created this index on the 3rd of January, the min_age is 2+ days as the origination date we configured is the 1st of January.

<3> The lifecycle is still in the new phase as the min_age configured for the warm phase is 7 days and the current index age is 2.46 days.

Both the index.lifecycle.origination_date and index.lifecycle.parse_origination_date settings are dynamic and can be updated at any point using the cluster update settings API.

Note: All phase transitions timings are based on the origination date if configured (even if the ILM policy uses the rollover action)

Ta-da!

In this post, we brushed up on the ILM phase timings calculations and looked at how to bridge the gap between the data and the index age using the index.lifecycle.origination_date and the index.lifecycle.parse_origination_date index settings. We instructed ILM to take into account a different, manually specified age for its phase timings, rather than the index age, so that our full dataset, both old and new can go through the same lifecycle.

If you haven’t used ILM yet, check out our blog post on using ILM to implement a hot-warm-cold architecture, which is a common way to manage time-series data. And if you want to give ILM a try, spin up a free trial of Elasticsearch Service (choose the Hot-Warm template to make it even easier).