WARNING: Version 2.4 of Elasticsearch has passed its EOL date.
This documentation is no longer being maintained and may be removed. If you are running this version, we strongly advise you to upgrade. For the latest information, see the current release documentation.
Reindex API
editReindex API
editThe reindex API is new and should still be considered experimental. The API may change in ways that are not backwards compatible
Reindex does not attempt to set up the destination index. It does
not copy the settings of the source index. You should set up the destination
index prior to running a _reindex action, including setting up mappings, shard
counts, replicas, etc.
The most basic form of _reindex just copies documents from one index to another.
This will copy documents from the twitter index into the new_twitter index:
POST /_reindex
{
"source": {
"index": "twitter"
},
"dest": {
"index": "new_twitter"
}
}
That will return something like this:
{
"took" : 147,
"timed_out": false,
"created": 120,
"updated": 0,
"batches": 1,
"version_conflicts": 0,
"failures" : [ ],
"created": 12344
}
Just like _update_by_query, _reindex gets a
snapshot of the source index but its target must be a different index so
version conflicts are unlikely. The dest element can be configured like the
index API to control optimistic concurrency control. Just leaving out
version_type (as above) or setting it to internal will cause Elasticsearch
to blindly dump documents into the target, overwriting any that happen to have
the same type and id:
POST /_reindex
{
"source": {
"index": "twitter"
},
"dest": {
"index": "new_twitter",
"version_type": "internal"
}
}
Setting version_type to external will cause Elasticsearch to preserve the
version from the source, create any documents that are missing, and update
any documents that have an older version in the destination index than they do
in the source index:
POST /_reindex
{
"source": {
"index": "twitter"
},
"dest": {
"index": "new_twitter",
"version_type": "external"
}
}
Settings op_type to create will cause _reindex to only create missing
documents in the target index. All existing documents will cause a version
conflict:
POST /_reindex
{
"source": {
"index": "twitter"
},
"dest": {
"index": "new_twitter",
"op_type": "create"
}
}
By default version conflicts abort the _reindex process but you can just
count them by settings "conflicts": "proceed" in the request body:
POST /_reindex
{
"conflicts": "proceed",
"source": {
"index": "twitter"
},
"dest": {
"index": "new_twitter",
"op_type": "create"
}
}
You can limit the documents by adding a type to the source or by adding a
query. This will only copy tweet's made by kimchy into new_twitter:
POST /_reindex
{
"source": {
"index": "twitter",
"type": "tweet",
"query": {
"term": {
"user": "kimchy"
}
}
},
"dest": {
"index": "new_twitter"
}
}
index and type in source can both be lists, allowing you to copy from
lots of sources in one request. This will copy documents from the tweet and
post types in the twitter and blog index. It’d include the post type in
the twitter index and the tweet type in the blog index. If you want to be
more specific you’ll need to use the query. It also makes no effort to handle
ID collisions. The target index will remain valid but it’s not easy to predict
which document will survive because the iteration order isn’t well defined.
POST /_reindex
{
"source": {
"index": ["twitter", "blog"],
"type": ["tweet", "post"]
},
"dest": {
"index": "all_together"
}
}
It’s also possible to limit the number of processed documents by setting
size. This will only copy a single document from twitter to
new_twitter:
POST /_reindex
{
"size": 1,
"source": {
"index": "twitter"
},
"dest": {
"index": "new_twitter"
}
}
If you want a particular set of documents from the twitter index you’ll
need to sort. Sorting makes the scroll less efficient but in some contexts
it’s worth it. If possible, prefer a more selective query to size and sort.
This will copy 10000 documents from twitter into new_twitter:
POST /_reindex
{
"size": 10000,
"source": {
"index": "twitter",
"sort": { "date": "desc" }
},
"dest": {
"index": "new_twitter"
}
}
The source section supports all the elements that are supported in a
search request. For instance only a subset of the
fields from the original documents can be reindexed using source filtering
as follows:
POST _reindex
{
"source": {
"index": "twitter",
"_source": ["user", "tweet"]
},
"dest": {
"index": "new_twitter"
}
}
Like _update_by_query, _reindex supports a script that modifies the
document. Unlike _update_by_query, the script is allowed to modify the
document’s metadata. This example bumps the version of the source document:
POST /_reindex
{
"source": {
"index": "twitter"
},
"dest": {
"index": "new_twitter",
"version_type": "external"
},
"script": {
"inline": "if (ctx._source.foo == 'bar') {ctx._version++; ctx._source.remove('foo')}"
}
}
Think of the possibilities! Just be careful! With great power…. You can change:
-
_id -
_type -
_index -
_version -
_routing -
_parent -
_timestamp -
_ttl
Setting _version to null or clearing it from the ctx map is just like not
sending the version in an indexing request. It will cause that document to be
overwritten in the target index regardless of the version on the target or the
version type you use in the _reindex request.
By default if _reindex sees a document with routing then the routing is
preserved unless it’s changed by the script. You can set routing on the
dest request to change this:
-
keep - Sets the routing on the bulk request sent for each match to the routing on the match. The default.
-
discard - Sets the routing on the bulk request sent for each match to null.
-
=<some text> -
Sets the routing on the bulk request sent for each match to all text after
the
=.
For example, you can use the following request to copy all documents from
the source index with the company name cat into the dest index with
routing set to cat.
POST /_reindex
{
"source": {
"index": "source"
"query": {
"match": {
"company": "cat"
}
}
},
"dest": {
"index": "dest",
"routing": "=cat"
}
}
By default _reindex uses scroll batches of 1000. You can change the
batch size with the size field in the source element:
POST _reindex
{
"source": {
"index": "source",
"size": 100
},
"dest": {
"index": "dest"
}
}
URL Parameters
editIn addition to the standard parameters like pretty, the Reindex API also
supports refresh, wait_for_completion, consistency, timeout, and
requests_per_second.
Sending the refresh url parameter will cause all indexes to which the request
wrote to be refreshed. This is different than the Index API’s refresh
parameter which causes just the shard that received the new data to be refreshed.
If the request contains wait_for_completion=false then Elasticsearch will
perform some preflight checks, launch the request, and then return a task
which can be used with Tasks APIs to cancel or get
the status of the task. For now, once the request is finished the task is gone
and the only place to look for the ultimate result of the task is in the
Elasticsearch log file. This will be fixed soon.
consistency controls how many copies of a shard must respond to each write
request. timeout controls how long each write request waits for unavailable
shards to become available. Both work exactly how they work in the
Bulk API.
requests_per_second can be set to any decimal number (1.4, 6, 1000, etc)
and throttles the number of requests per second that the reindex issues. The
throttling is done waiting between bulk batches so that it can manipulate the
scroll timeout. The wait time is the difference between the time it took the
batch to complete and the time requests_per_second * requests_in_the_batch.
Since the batch isn’t broken into multiple bulk requests large batch sizes will
cause Elasticsearch to create many requests and then wait for a while before
starting the next set. This is "bursty" instead of "smooth". The default is
unlimited which is also the only non-number value that it accepts.
Response body
editThe JSON response looks like this:
{
"took" : 639,
"updated": 0,
"created": 123,
"batches": 1,
"version_conflicts": 2,
"retries": 0,
"throttled_millis": 0,
"failures" : [ ]
}
-
took - The number of milliseconds from start to end of the whole operation.
-
updated - The number of documents that were successfully updated.
-
created - The number of documents that were successfully created.
-
batches - The number of scroll responses pulled back by the the reindex.
-
version_conflicts - The number of version conflicts that reindex hit.
-
retries - The number of retries that the reindex did in response to a full queue.
-
throttled_millis -
Number of milliseconds the request slept to conform to
requests_per_second. -
failures -
Array of all indexing failures. If this is non-empty then the request aborted
because of those failures. See
conflictsfor how to prevent version conflicts from aborting the operation.
Works with the Task API
editWhile Reindex is running you can fetch their status using the Task API:
GET /_tasks/?pretty&detailed=true&actions=*reindex
The responses looks like:
{
"nodes" : {
"r1A2WoRbTwKZ516z6NEs5A" : {
"name" : "Tyrannus",
"transport_address" : "127.0.0.1:9300",
"host" : "127.0.0.1",
"ip" : "127.0.0.1:9300",
"attributes" : {
"testattr" : "test",
"portsfile" : "true"
},
"tasks" : {
"r1A2WoRbTwKZ516z6NEs5A:36619" : {
"node" : "r1A2WoRbTwKZ516z6NEs5A",
"id" : 36619,
"type" : "transport",
"action" : "indices:data/write/reindex",
"status" : {
"total" : 6154,
"updated" : 3500,
"created" : 0,
"deleted" : 0,
"batches" : 4,
"version_conflicts" : 0,
"noops" : 0,
"retries": 0,
"throttled_millis": 0
},
"description" : ""
}
}
}
}
}
|
this object contains the actual status. It is just like the response json
with the important addition of the |
Works with the Cancel Task API
editAny Reindex can be canceled using the Task Cancel API:
POST /_tasks/{task_id}/_cancel
The task_id can be found using the tasks API above.
Cancelation should happen quickly but might take a few seconds. The task status API above will continue to list the task until it is wakes to cancel itself.
Rethrottling
editThe value of requests_per_second can be changed on a running reindex using
the _rethrottle API:
POST /_reindex/{task_id}/_rethrottle?requests_per_second=unlimited
The task_id can be found using the tasks API above.
Just like when setting it on the _reindex API requests_per_second can be
either unlimited to disable throttling or any decimal number like 1.7 or
12 to throttle to that level. Rethrottling that speeds up the query takes
effect immediately but rethrotting that slows down the query will take effect
on after completing the current batch. This prevents scroll timeouts.
Reindex to change the name of a field
edit_reindex can be used to build a copy of an index with renamed fields. Say you
create an index containing documents that look like this:
POST test/test/1?refresh&pretty
{
"text": "words words",
"flag": "foo"
}
But you don’t like the name flag and want to replace it with tag.
_reindex can create the other index for you:
POST _reindex?pretty
{
"source": {
"index": "test"
},
"dest": {
"index": "test2"
},
"script": {
"inline": "ctx._source.tag = ctx._source.remove(\"flag\")"
}
}
Now you can get the new document:
GET test2/test/1?pretty
and it’ll look like:
{
"text": "words words",
"tag": "foo"
}
Or you can search by tag or whatever you want.