Tech Topics

Reindex is coming!

_reindex and _update_by_query are coming to Elasticsearch 2.3.0 and 5.0.0-alpha1! Hurray!

_reindex reads documents from one index and writes them to another index. It can be used to copy documents from one index to another, enrich documents with fields, or recreate the index to change settings that are locked when the index is created.

_update_by_query reads documents from an index and writes them back to the same index. It can be used to update fields in many documents at once or to pick up mapping changes that can be made online.

_reindex copies documents

The _reindex API is really just a convenient way to copy documents from one index to another. Everything else that it can do is an outgrowth of that. If all you want to do is to copy all the documents from the src index into the dest index you invoke _reindex like this:

curl -XPOST localhost:9200/_reindex?pretty -d'{
  "source": {
    "index": "src"
  },
  "dest": {
    "index": "dest"
  }
}'

If you want to be a little more selective and, say, only copy docments tagged with bananas you invoke _reindex like this:

curl -XPOST localhost:9200/_reindex?pretty -d'{
  "source": {
    "index": "src",
    "query": {
      "match": {
        "tags": "bananas"
      }
    }
  },
  "dest": {
    "index": "dest"
  }
}'

If you want to copy documents tagged with bananas but you want to add the chocolate tag to all copied documents you invoke _reindex like this:

curl -XPOST localhost:9200/_reindex?pretty -d'{
  "source": {
    "index": "src",
    "query": {
      "match": {
        "tags": "bananas"
      }
    }
  },
  "dest": {
    "index": "dest"
  },
  "script": {
    "inline": "ctx._source.tags += \"chocolate\""
  }
}'

That requires that you have dynamic scripts enabled but you can do the same thing with non- inline scripts.

Recreating an index to change settings that are locked at index creations is a bit more involved but still simpler than before _reindex:

# Say you have an old index that you made like this
curl -XPUT localhost:9200/test_1 -d'{
  "aliases": {
    "test": {}
  }
}'
for i in $(seq 1 1000); do
  curl -XPOST localhost:9200/test/test -d'{"tags": ["bananas"]}'
  echo
done
curl -XPOST localhost:9200/test/_refresh?pretty
# But you don't like having the default number of shards
# You can make a copy of it with the new number of shards
curl -XPUT localhost:9200/test_2 -d'{
  "settings": {
    "number_of_shards": 1
  }
}'
curl -XPOST 'localhost:9200/_reindex?pretty&refresh' -d'{
  "source": {
    "index": "test"
  },
  "dest": {
    "index": "test_2"
  }
}'
# Then just swing the alias to the new index
curl -XPOST localhost:9200/_aliases?pretty -d'{
  "actions": [
    { "remove": { "index": "test_1", "alias": "index" } },
    { "add": { "index": "test_2", "alias": "index" } }
  ]
}'
# Then when you are good and sure you are done with it you can
curl -XDELETE localhost:9200/test_1?pretty

_update_by_query modifies documents

The simplest way to invoke update by query isn't particularly useful on its own:

curl -XPOST localhost:9200/test/_update_by_query?pretty

That will just increment the document version number on each document in the test index and fail if you modify a document while it is running. A more interesting example is adding the chocolate tag to all documents with the bananas tag:

curl -XPOST 'localhost:9200/test/_update_by_query?pretty&refresh' -d'{
  "query": {
    "bool": {
      "must": [ {"match": {"tags": "bananas"}} ],
      "must_not": [ {"match": {"tags": "chocolate"}} ]
    }
  },
  "script": {
    "inline": "ctx._source.tags += \"chocolate\""
  }
}'

Like the last version this will fail if any documents are changed while it is running, but it is written in such a way that you can just retry it and it'll pick up from where it left off. If you've already modified whatever application is making the concurrent updates to add the chocolate tag whenever it sees bananas then you can safely ignore version conflicts in the _update_by_query. You can tell it to do so by setting conflicts=proceed. It will just count the version conflicts and continue performing updates. Now the command looks like this:

curl -XPOST 'localhost:9200/test/_update_by_query?pretty&refresh&conflicts=proceed' -d'{
  "query": {
    "bool": {
      "must": [ {"match": {"tags": "bananas"}} ],
      "must_not": [ {"match": {"tags": "chocolate"}} ]
    }
  },
  "script": {
    "inline": "ctx._source.tags += \"chocolate\""
  }
}'

Finally, you can use _update_by_query to suck up mapping changes that only take effect when the document is modified like adding a new field to an existing field. For example:

# Say I made an index with tags not_analyzed because, you know, they are tags after all
curl -XPUT localhost:9200/test_3?pretty -d'{
  "mappings": {
    "test": {
      "properties": {
        "tags": {
          "type": "string",
          "index": "not_analyzed"
        }
      }
    }
  }
}'
for i in $(seq 1 1000); do
  curl -XPOST localhost:9200/test_3/test -d'{"tags": ["bananas"]}'
  echo
done
curl -XPOST localhost:9200/test_3/_refresh?pretty
# But now I want to search on tags using the standard analyzer so I can search for banana and find bananas
curl -XPUT localhost:9200/test_3/_mapping/test?pretty -d'{
  "properties": {
    "tags": {
      "type": "string",
      "index": "not_analyzed",
      "fields": {
        "analyzed": {
          "type": "string",
          "analyzer": "standard"
        }
      }
    }
  }
}'
# This doesn't take effect immediately
curl 'localhost:9200/test_3/_search?pretty' -d'{
  "query": {
    "match": {
      "tags.analyzed": "bananas"
    }
  }
}'
# :(
# But we can _update_by_query to pick up the new mapping on all documents
curl -XPOST 'localhost:9200/test_3/_update_by_query?pretty&conflicts=proceed&refresh'
# And now the new mapping has been applied to the whole index!
curl 'localhost:9200/test_3/_search?pretty' -d'{
  "query": {
    "match": {
      "tags.analyzed": "bananas"
    }
  }
}'

Getting the status

_reindex and _update_by_query can touch millions of documents so they can take a long time. You can fetch their status with:

curl localhost:9200/_tasks?pretty&detailed&actions=*reindex,*byquery

That will contain a field that looks like:

"BHgHr0cETkOehwqZ2N_-aQ:28295" : {
  "node" : "BHgHr0cETkOehwqZ2N_-aQ",
  "id" : 28295,
  "type" : "transport",
  "action" : "indices:data/write/reindex",
  "start_time_in_millis" : 1458767149108,
  "running_time_in_nanos" : 5475314,
  "status" : {
    "total" : 6154,
    "updated" : 3500,
    "created" : 0,
    "deleted" : 0,
    "batches" : 36,
    "version_conflicts" : 0,
    "noops" : 0,
    "retries": 0,
    "throttled_millis": 0
  }
}

You can read the docs for more, but the gist is that _reindex plans to do total operations and has already done updated + created + deleted + noops of them. So you can estimate how complete the request is by dividing those numbers.

Cancelling

_reindex was so long in coming because Elasticsearch lacked a way to cancel running tasks. For short running tasks like _search and indexing that is fine. But, like I wrote above, _reindex and _update_by_query can touch millions of documents are take a long time. The tasks themselves are ok with that, but you may not be. Say you realize ten minutes into a three hour long _update_by_query that you made a mistake in the script. There isn't a way to rollback the changes that the reindex already made but you can cancel it so it won't make any more such changes:

curl -XPOST localhost:9200/_task/{taskId}/_cancel

And where do you get the taskId? It is the name of the object returned by the task listing API in the last section of this blog post. The one in the example return is BHgHr0cETkOehwqZ2N_-aQ:28295.

In Elasticsearch task cancelation is opt in. It kind of has to be that way in any Java application. Anyway, tasks that can be canceled like _reindex and _update_by_query periodically check to see if they have been canceled and then shut themselves down. This means that you might see the task if you immediately list its status after it has been canceled. It will go away on its own and you can't cancel it any harder without stopping the node it is running on.

Remember that Elasticsearch is a search engine

Every update has to mark the document as deleted and index the entire new document. The deleted documents have to then be merged out of the index. _reindex and _update_by_query don't save anything in that process. They work just as though you performed a scroll query and indexed all the results. Running a zillion _reindexs or _update_by_querys is unlikely to be the most efficient use of computer resources to accomplish some task. You will almost always be better off making changes to the application that adds data to Elasticsearch rather than updating the data after the fact. _reindex and _update_by_query are most useful for turning the data that you already have in Elasticsearch into the data that you want to be in Elasticsearch.