Reindex APIedit

The reindex API is new and should still be considered experimental. The API may change in ways that are not backwards compatible

The most basic form of _reindex just copies documents from one index to another. This will copy documents from the twitter index into the new_twitter index:

POST /_reindex
{
  "source": {
    "index": "twitter"
  },
  "dest": {
    "index": "new_twitter"
  }
}

That will return something like this:

{
  "took" : 639,
  "updated": 112,
  "batches": 130,
  "version_conflicts": 0,
  "failures" : [ ],
  "created": 12344
}

Just like _update_by_query, _reindex gets a snapshot of the source index but its target must be a different index so version conflicts are unlikely. The dest element can be configured like the index API to control optimistic concurrency control. Just leaving out version_type (as above) or setting it to internal will cause Elasticsearch to blindly dump documents into the target, overwriting any that happen to have the same type and id:

POST /_reindex
{
  "source": {
    "index": "twitter"
  },
  "dest": {
    "index": "new_twitter",
    "version_type": "internal"
  }
}

Setting version_type to external will cause Elasticsearch to preserve the version from the source, create any documents that are missing, and update any documents that have an older version in the destination index than they do in the source index:

POST /_reindex
{
  "source": {
    "index": "twitter"
  },
  "dest": {
    "index": "new_twitter",
    "version_type": "external"
  }
}

Settings op_type to create will cause _reindex to only create missing documents in the target index. All existing documents will cause a version conflict:

POST /_reindex
{
  "source": {
    "index": "twitter"
  },
  "dest": {
    "index": "new_twitter",
    "op_type": "create"
  }
}

By default version conflicts abort the _reindex process but you can just count them by settings "conflicts": "proceed" in the request body:

POST /_reindex
{
  "conflicts": "proceed",
  "source": {
    "index": "twitter"
  },
  "dest": {
    "index": "new_twitter",
    "op_type": "create"
  }
}

You can limit the documents by adding a type to the source or by adding a query. This will only copy tweet's made by kimchy into new_twitter:

POST /_reindex
{
  "source": {
    "index": "twitter",
    "type": "tweet",
    "query": {
      "term": {
        "user": "kimchy"
      }
    }
  },
  "dest": {
    "index": "new_twitter"
  }
}

index and type in source can both be lists, allowing you to copy from lots of sources in one request. This will copy documents from the tweet and post types in the twitter and blog index. It’d include the post type in the twitter index and the tweet type in the blog index. If you want to be more specific you’ll need to use the query. It also makes no effort to handle ID collisions. The target index will remain valid but it’s not easy to predict which document will survive because the iteration order isn’t well defined.

POST /_reindex
{
  "source": {
    "index": ["twitter", "blog"],
    "type": ["tweet", "post"]
  },
  "dest": {
    "index": "all_together"
  }
}

It’s also possible to limit the number of processed documents by setting size. This will only copy a single document from twitter to new_twitter:

POST /_reindex
{
  "size": 1,
  "source": {
    "index": "twitter"
  },
  "dest": {
    "index": "new_twitter"
  }
}

If you want a particular set of documents from the twitter index you’ll need to sort. Sorting makes the scroll less efficient but in some contexts it’s worth it. If possible, prefer a more selective query to size and sort. This will copy 10000 documents from twitter into new_twitter:

POST /_reindex
{
  "size": 10000,
  "source": {
    "index": "twitter",
    "sort": { "date": "desc" }
  },
  "dest": {
    "index": "new_twitter"
  }
}

Like _update_by_query, _reindex supports a script that modifies the document. Unlike _update_by_query, the script is allowed to modify the document’s metadata. This example bumps the version of the source document:

POST /_reindex
{
  "source": {
    "index": "twitter"
  },
  "dest": {
    "index": "new_twitter",
    "version_type": "external"
  },
  "script": {
    "inline": "if (ctx._source.foo == 'bar') {ctx._version++; ctx._source.remove('foo')}"
  }
}

Think of the possibilities! Just be careful! With great power…​. You can change:

  • _id
  • _type
  • _index
  • _version
  • _routing
  • _parent
  • _timestamp
  • _ttl

Setting _version to null or clearing it from the ctx map is just like not sending the version in an indexing request. It will cause that document to be overwritten in the target index regardless of the version on the target or the version type you use in the _reindex request.

By default if _reindex sees a document with routing then the routing is preserved unless it’s changed by the script. You can set routing on the dest request to change this:

keep
Sets the routing on the bulk request sent for each match to the routing on the match. The default.
discard
Sets the routing on the bulk request sent for each match to null.
=<some text>
Sets the routing on the bulk request sent for each match to all text after the =.

For example, you can use the following request to copy all documents from the source index with the company name cat into the dest index with routing set to cat.

POST /_reindex
{
  "source": {
    "index": "source"
    "query": {
      "match": {
        "company": "cat"
      }
    }
  },
  "dest": {
    "index": "dest",
    "routing": "=cat"
  }
}

By default _reindex uses scroll batches of 100. You can change the batch size with the size field in the source element:

POST /_reindex
{
  "source": {
    "index": "source",
    "size": 1500
  },
  "dest": {
    "index": "dest"
  }
}

URL Parametersedit

In addition to the standard parameters like pretty, the Reindex API also supports refresh, wait_for_completion, consistency, and timeout.

Sending the refresh url parameter will cause all indexes to which the request wrote to be refreshed. This is different than the Index API’s refresh parameter which causes just the shard that received the new data to be indexed.

If the request contains wait_for_completion=false then Elasticsearch will perform some preflight checks, launch the request, and then return a task which can be used with Tasks APIs to cancel or get the status of the task. For now, once the request is finished the task is gone and the only place to look for the ultimate result of the task is in the Elasticsearch log file. This will be fixed soon.

consistency controls how many copies of a shard must respond to each write request. timeout controls how long each write request waits for unavailable shards to become available. Both work exactly how they work in the Bulk API.

Response bodyedit

The JSON response looks like this:

{
  "took" : 639,
  "updated": 0,
  "created": 123,
  "batches": 1,
  "version_conflicts": 2,
  "failures" : [ ]
}
took
The number of milliseconds from start to end of the whole operation.
updated
The number of documents that were successfully updated.
created
The number of documents that were successfully created.
batches
The number of scroll responses pulled back by the the reindex.
version_conflicts
The number of version conflicts that reindex hit.
failures
Array of all indexing failures. If this is non-empty then the request aborted because of those failures. See conflicts for how to prevent version conflicts from aborting the operation.

Works with the Task APIedit

While Reindex is running you can fetch their status using the Task API:

GET /_tasks/?pretty&detailed=true&actions=*reindex

The responses looks like:

{
  "nodes" : {
    "r1A2WoRbTwKZ516z6NEs5A" : {
      "name" : "Tyrannus",
      "transport_address" : "127.0.0.1:9300",
      "host" : "127.0.0.1",
      "ip" : "127.0.0.1:9300",
      "attributes" : {
        "testattr" : "test",
        "portsfile" : "true"
      },
      "tasks" : {
        "r1A2WoRbTwKZ516z6NEs5A:36619" : {
          "node" : "r1A2WoRbTwKZ516z6NEs5A",
          "id" : 36619,
          "type" : "transport",
          "action" : "indices:data/write/reindex",
          "status" : {    
            "total" : 6154,
            "updated" : 3500,
            "created" : 0,
            "deleted" : 0,
            "batches" : 36,
            "version_conflicts" : 0,
            "noops" : 0
          },
          "description" : ""
        }
      }
    }
  }
}

this object contains the actual status. It is just like the response json with the important addition of the total field. total is the total number of operations that the reindex expects to perform. You can estimate the progress by adding the updated, created, and deleted fields. The request will finish when their sum is equal to the total field.

Reindex to change the name of a fieldedit

_reindex can be used to build a copy of an index with renamed fields. Say you create an index containing documents that look like this:

POST test/test/1?refresh&pretty
{
  "text": "words words",
  "flag": "foo"
}

But you don’t like the name flag and want to replace it with tag. _reindex can create the other index for you:

POST _reindex?pretty
{
  "source": {
    "index": "test"
  },
  "dest": {
    "index": "test2"
  },
  "script": {
    "inline": "ctx._source.tag = ctx._source.remove(\"flag\")"
  }
}

Now you can get the new document:

GET test2/test/1?pretty

and it’ll look like:

{
  "text": "words words",
  "tag": "foo"
}

Or you can search by tag or whatever you want.