17 juin 2016 Technique

Where are my documents?
Refreshing news...

Par Nik Everett

When you send Elasticsearch a request that modifies or creates documents and it replies with 200 OK or 201 CREATED it has synced the changes to disk on all active shards 1. That means that the changes will survive catastrophic system shutdown but it doesn't mean that the changes are available for search. The process that makes changes available for search is called a "refresh" and it is the topic of this post.

Refreshes are performed periodically ( index.refresh_interval), when the indexing buffer is full, and on demand (?refresh)2. On demand refreshing is rarely used outside of testing because it creates small index segments which are inefficient to create and search and must later be merged into larger segments. Waiting for the indexing buffer to be full is unpredictable so we can't rely on it either. That means that we mostly think of the index as being refreshed every index.refresh_interval, which defaults to 1 second.

The problem

Refreshing every second is fine if you are indexing something like logs where you expect to be some amount of time behind real time, but if you are indexing blog posts or comments or calendars then it can be a bit difficult. For anything where a user might expect to make a change and immediately be able to search for that change (blog, forum, scheduling app) your application needs some way to know that the change is visible for search. This is doubly true for applications that want to use search for something interesting after the user's change (think scheduling or aggregations). In those cases you have a few options all of which have interesting tradeoffs:

  1. Wait for the refresh, perhaps polling to check that it is there.
  2. Force a refresh with ?refresh
  3. Wait for the refresh to occur with ?refresh=wait_for (coming in 5.0-alpha4!)

Wait for the refresh

You could just wait for the refresh interval to pass. This has the advantage of being something you can do totally asynchronously. The disadvantage is that you have to wait for the whole one second and even then it is not guaranteed. Refresh isn't instant. Usually it is pretty quick but some refreshes will be slower than others so you can't really predict it. For applications where you can tolerate not knowing for sure if something is available for search then this is totally the right choice. But this blog post really isn't about those applications. So, for the sake of this blog post, we're going to assume this option isn't good enough for you.

Force a refresh with ?refresh

You could force an immediate refresh. This has the advantage of being pretty quick. Like I said a few paragraphs up, it has the disadvantage of creating small segments that are inefficient to create, search, and merge. For plenty of use cases this inefficiency is worth the speed. Don't be afraid to force a refresh if it makes sense for your use case.

For example, say you are loading something into Elasticsearch and plan to analyze the results. This search index is just for you so you know when you are done loading documents. At that point you shouldn't hesitate to refresh the index. Waiting isn't going to help.

I should mention that adding ?refresh to an index, update, delete, or bulk request is subtly different than performing a refresh API call. Refresh API calls will refresh all the shards on the index. ?refresh will only refresh the shards that have been modified. So for index, update, and delete requests that is just the shard to which the document was routed. For bulk requests that is all shards to which any document was routed.

?refresh might also be a bad choice because it affects other indexing in the same index. Say you have a bulk loading process that works quite well. But now you want to start inserting a few documents into the same index interactively. If you do it with ?refresh then, suddenly, you've started refreshing documents outside of whatever refresh interval you were using for the bulk load. If you do that frequently enough that'll change the search and index performance of the bulk loading process.

Wait for the refresh with ?refresh=wait_for (coming in 5.0-alpha4!)

Elasticsearch 5.0 brings a hybrid approach between the two options. Adding ?refresh=wait_for to index, update, delete, or bulk request will cause the request to wait until its changes have been made visible for search before returning to the user. This has the advantage of being correct without creating inefficient segments. It has the disadvantage of having to wait for the refresh. You don't have to wait for as long as the "wait for the refresh" option because Elasticsearch signals you as soon as the document is ready for search. So if the change comes half way through the refresh interval you only have to wait for half of the time.

Unlike ?refresh, ?refresh=wait_for won't affect concurrent indexing on the same index. It has no effect on segment size because it doesn't force a refresh immediately. If you must know when the refresh happens, you can wait for the refresh, and you plan to upgrade to 5.0  3, then this is the right choice!

Even if you are super excited to upgrade to 5.0-alpha4 to get this feature keep in mind that Elasticsearch's alphas and betas are for testing purposes only because they aren't compatible with the GA release. We are still finalizing the wire level communications and on disk layout so 5.0' alphas and betas aren't guaranteed to upgrade to properly to 5.0.0, either with rolling restarts or a full cluster restart. Please test this feature to see if it fits for you but don't upgrade production clusters to alphas or betas.

Back to the feature, there is a limit to the number of ?refresh=wait_for API calls that can be waiting on any one shard: index.max_refresh_listeners which defaults to 1000. If a request with ?refresh=wait_for comes in while all the slots are full then Elasticsearch will refresh the shard and reply to the request immediately.

What does ?refresh=wait_for do if you set the index.refresh_interval to -1, disabling periodic refreshes, you may ask? Well the answer is that ?refresh=wait_for will honor whatever refresh interval you configure. The request will only return when you fill the indexing buffer, force an explicit refresh, or try to wait on more than index.max_refresh_listeners requests in the same shard.

index.refresh_interval is just about the maximum number of time that ?refresh=wait_for will have to wait for the changes to become visible. If you use ?refresh=wait_for, raising the refresh interval will make indexing feel slower and slower to your users. And lowering it will make indexing feel faster and faster. So it might be tempting to lower the refresh interval. Doing so will make less and less efficient segments. Making the refresh interval the same as the write rate is as inefficient as using ?refresh on every request.

Pick the refresh strategy that makes sense for you

Ultimately there is no silver bullet for refreshes. Elasticsearch's index.refresh_interval is a useful because it coalesces several changes into one big change to the search index, making a more efficient index. You either wait for the refresh interval, potentially slowing down your users, or you force an immediate refresh and pay the price at search and merge time. ?refresh=wait_for gives you a tool to make waiting for the refresh interval interactive so you can make whatever tradeoffs make sense for you.

Footnotes

1 This is not always true but it is the recommended configuration. See index.translog.durability.

2 Refreshes also occur during recovery, the process that moves shards between nodes. This ought to be rare enough not to factor into most thinking about refreshes.

3 You can cobble together something that works alright in older versions of Elasticsearch using the steps here and/or here . The trouble is that it doesn't work well with bulk and or replicas. It is far from perfect.