A scrolled search takes a snapshot in time. It doesn’t see any changes that are made to the index after the initial search request has been made. It does this by keeping the old data files around, so that it can preserve its “view” on what the index looked like at the time it started.
The costly part of deep pagination is the global sorting of results, but if we
disable sorting, then we can return all documents quite cheaply. To do this, we
scansearch type. Scan instructs Elasticsearch to do no sorting, but to just return the next batch of results from every shard that still has results to return.
The response to this request doesn’t include any hits, but does include a
_scroll_id, which is a long Base-64 encoded string. Now we can pass the
_scroll_id to the
_search/scroll endpoint to retrieve the first batch of
Keep the scroll open for another minute.
Note that we again specify
?scroll=1m. The scroll expiry time is refreshed
every time we run a scroll request, so it needs to give us only enough time
to process the current batch of results, not all of the documents that match
The response to this scroll request includes the first batch of results.
Although we specified a
size of 1,000, we get back many more
When scanning, the
size is applied to each shard, so you will
get back a maximum of
size * number_of_primary_shards documents in each
The scroll request also returns a new
_scroll_id. Every time
we make the next scroll request, we must pass the
_scroll_id returned by the
previous scroll request.
When no more hits are returned, we have processed all matching documents.
Some of the official Elasticsearch clients provide scan-and-scroll helpers that provide an easy wrapper around this functionality.