Scrolledit

A scroll query is used to retrieve large numbers of documents from Elasticsearch efficiently, without paying the penalty of deep pagination.

Scrolling allows us to do an initial search and to keep pulling batches of results from Elasticsearch until there are no more results left. It’s a bit like a cursor in a traditional database.

A scrolled search takes a snapshot in time. It doesn’t see any changes that are made to the index after the initial search request has been made. It does this by keeping the old data files around, so that it can preserve its “view” on what the index looked like at the time it started.

The costly part of deep pagination is the global sorting of results, but if we disable sorting, then we can return all documents quite cheaply. To do this, we sort by _doc. This instructs Elasticsearch just return the next batch of results from every shard that still has results to return.

To scroll through results, we execute a search request and set the scroll value to the length of time we want to keep the scroll window open. The scroll expiry time is refreshed every time we run a scroll request, so it only needs to be long enough to process the current batch of results, not all of the documents that match the query. The timeout is important because keeping the scroll window open consumes resources and we want to free them as soon as they are no longer needed. Setting the timeout enables Elasticsearch to automatically free the resources after a small period of inactivity.

GET /old_index/_search?scroll=1m 
{
    "query": { "match_all": {}},
    "sort" : ["_doc"], 
    "size":  1000
}

Keep the scroll window open for 1 minute.

_doc is the most efficient sort order.

The response to this request includes a _scroll_id, which is a long Base-64 encoded string. Now we can pass the _scroll_id to the _search/scroll endpoint to retrieve the next batch of results:

GET /_search/scroll
{
    "scroll": "1m", 
    "scroll_id" : "cXVlcnlUaGVuRmV0Y2g7NTsxMDk5NDpkUmpiR2FjOFNhNnlCM1ZDMWpWYnRROzEwOTk1OmRSamJHYWM4U2E2eUIzVkMxalZidFE7MTA5OTM6ZFJqYkdhYzhTYTZ5QjNWQzFqVmJ0UTsxMTE5MDpBVUtwN2lxc1FLZV8yRGVjWlI2QUVBOzEwOTk2OmRSamJHYWM4U2E2eUIzVkMxalZidFE7MDs="
}

Note that we again set the scroll expiration to 1m.

The response to this scroll request includes the next batch of results. Although we specified a size of 1,000, we get back many more documents. When scanning, the size is applied to each shard, so you will get back a maximum of size * number_of_primary_shards documents in each batch.

The scroll request also returns a new _scroll_id. Every time we make the next scroll request, we must pass the _scroll_id returned by the previous scroll request.

When no more hits are returned, we have processed all matching documents.

Some of the official Elasticsearch clients such as Python client and Perl client provide scroll helpers that provide easy-to-use wrappers around this funtionality.