Free yourself from operations with Elastic Cloud Serverless. Scale automatically, handle load spikes, and focus on building—start a 14-day free trial to test it out yourself!
You can follow these guides to build an AI-Powered search experience or search across business systems and software.
In a recent post, we shared a new system in Elastic Cloud Serverless that automatically adjusts the number of replicas for your indices based on search load. In this post, we’ll zoom out and look at the entire flow for determining the optimal number of replicas. We’ll also look more closely at a crucial consideration when adding replicas, cache capacity, and explain how each replica system ensures that data is entirely cacheable given the currently provisioned disk space.
Some Serverless basics
In Elasticsearch Serverless, the object store is the single source of truth. Indexing nodes hold all primary shards which index data and upload it to the object store; search nodes hold replica shards which handle all searches, reading data from the object store. So, in Serverless, indices always have at least one replica. The more search nodes, the more opportunities for additional replicas.
Both tiers are scaled independently based on the needs of the specific Serverless project. The search tier scales based on search power, a user configuration that determines the range of resources available for the cluster, including the minimum amount of disk for caching data. Specifically, boosted data: recent time-based (@timestamp) documents within the user-defined boost window, plus any non-time-based documents.
Replicas for instant failover (RfIF): one or two replicas based on search power
Back in 2024, we shipped the first replica system, Replicas for instant failover (RfIF), which decides whether to assign one or two replicas to an index in Serverless. In Serverless, the object store provides durability, but data cached directly on search nodes is what makes searches fast. Two replicas give us resiliency in case we lose a search node unexpectedly. However, more replicas are only effective if the additional copies of the data can fit in the cache.
RfIF looks at the amount of cache space guaranteed by search power. Lower-cost options can only fit one copy of the boosted data on disk. In these cases, all indices get one replica. As search power increases, we take advantage of additional cache space to provision two replicas for some indices. Finally, for the largest search power setting, there’s enough cache space for all indices to get two replicas.
Replicas for load balancing (RfLB): scaling replicas under search load
Replicas for load balancing (RfLB) was shipped in early 2026. A previous blog post explains the system in detail (a must-read if you like pizza). In short, the goal of RfLB is to add replicas to indices under high search load. More replicas means higher throughput as they allow more search nodes to participate in serving results.
Two systems, one recommendation
Both RfIF and RfLB run every five minutes and provide a recommended number of replicas for each index. RfIF recommends either one or two replicas based on search power and the amount of boosted data. RfLB recommends between one and N replicas based on search load, where N is the current number of search nodes. For each index, we take the larger of the two recommendations. These recommendations are then passed into a cache budgeting system, which may limit them based on available cache space.
Recall from the previous post that if the final recommendation increases replicas from the current state, we apply it immediately. However, if the recommendation is to decrease replicas, we simply capture the signal. Only after ~30 minutes of repeated signals to decrease replicas do we apply the new value.
Cache budgeting: ensuring replicas fit on disk
Now we’ll look a bit more closely at that last step: the cache budgeting system. If we provision too many replicas, a broad enough set of searches could cause frequent cache evictions, resulting in slow searches. To avoid this, we sometimes decrease the replica system’s recommendations to ensure all replicas will fit in the cache. RfIF already does this based on search power, but RfLB may recommend larger numbers of replicas for indices with high search load. These are exactly the recommendations we may limit here, those where RfLB exceeds RfIF.
Here’s some pseudocode of the algorithm (and explanation below):
That first line of code has a lot going on:
totalCacheBudgetis the amount of the provisioned disk in the cluster allocated for caching data.cacheUsedByCurrentReplicasis the amount of cache used by the current replica configuration.cacheFreedByDecreasesrepresents the amount of cache freed up by replica decreases that will occur as a result of the new recommendations. This is valuable cache space that we may want to use for more replicas!cacheNeededForRfIFIncreasesreflects the fact that we will not limit RfIF’s recommendations in this cache budgeting system. Therefore, we must account for all the additional cache RfIF’s recommendations will use.
Accounting for all of this, remainingBytes is precisely the amount of cache space we have for any additional replicas RfLB recommends over RfIF. The algorithm that follows considers all indices (allIndicesRankedByLoad is an ordered list of all indices prioritizing those with higher search load) and makes use of these remainingBytes to grant replicas. This procedure makes sure to squeeze all the replicas we can out of the bytes we have. For example, if RfLB recommends an additional four replicas, but there’s only space for two, we’ll take two!
Wrapping up
To summarize, replicas in Serverless are determined by two cooperating systems. One is focused on replication for instant failover, while the other replicates to provide better search throughput for indices under substantial search load. These systems run side by side, making a joint decision for each index and checking that the decision actually fits in the cache. The result is a Serverless offering you can rely on to adapt to changing conditions while respecting the bounds of the underlying resources to keep data cached and searches fast.




