Resolve lifecycle policy execution errors

When ILM executes a lifecycle policy, it’s possible for errors to occur while performing the necessary index operations for a step. When this happens, ILM moves the index to an ERROR step. If {ilm-init] cannot resolve the error automatically, execution is halted until you resolve the underlying issues with the policy, index, or cluster.

For example, you might have a shrink-index policy that shrinks an index to four shards once it is at least five days old:

PUT _ilm/policy/shrink-index
{
  "policy": {
    "phases": {
      "warm": {
        "min_age": "5d",
        "actions": {
          "shrink": {
            "number_of_shards": 4
          }
        }
      }
    }
  }
}

There is nothing that prevents you from applying the shrink-index policy to a new index that has only two shards:

PUT /myindex
{
  "settings": {
    "index.number_of_shards": 2,
    "index.lifecycle.name": "shrink-index"
  }
}

After five days, ILM attempts to shrink myindex from two shards to four shards. Because the shrink action cannot increase the number of shards, this operation fails and ILM moves myindex to the ERROR step.

You can use the ILM Explain API to get information about what went wrong:

GET /myindex/_ilm/explain

Which returns the following information:

{
  "indices" : {
    "myindex" : {
      "index" : "myindex",
      "managed" : true,
      "policy" : "shrink-index",                
      "lifecycle_date_millis" : 1541717265865,
      "age": "5.1d",                            
      "phase" : "warm",                         
      "phase_time_millis" : 1541717272601,
      "action" : "shrink",                      
      "action_time_millis" : 1541717272601,
      "step" : "ERROR",                         
      "step_time_millis" : 1541717272688,
      "failed_step" : "shrink",                 
      "step_info" : {
        "type" : "illegal_argument_exception",  
        "reason" : "the number of target shards [4] must be less that the number of source shards [2]"
      },
      "phase_execution" : {
        "policy" : "shrink-index",
        "phase_definition" : {                  
          "min_age" : "5d",
          "actions" : {
            "shrink" : {
              "number_of_shards" : 4
            }
          }
        },
        "version" : 1,
        "modified_date_in_millis" : 1541717264230
      }
    }
  }
}

The policy being used to manage the index: shrink-index

The index age: 5.1 days

The phase the index is currently in: warm

The current action: shrink

The step the index is currently in: ERROR

The step that failed to execute: shrink

The type of error and a description of that error.

The definition of the current phase from the shrink-index policy

To resolve this, you could update the policy to shrink the index to a single shard after 5 days:

PUT _ilm/policy/shrink-index
{
  "policy": {
    "phases": {
      "warm": {
        "min_age": "5d",
        "actions": {
          "shrink": {
            "number_of_shards": 1
          }
        }
      }
    }
  }
}

Retrying failed lifecycle policy steps

Once you fix the problem that put an index in the ERROR step, you might need to explicitly tell ILM to retry the step:

POST /myindex/_ilm/retry

ILM subsequently attempts to re-run the step that failed. You can use the ILM Explain API to monitor the progress.