Microsoft Azure DevOps Search: Upgrading Elasticsearch and NEST with Zero Downtime

The discussion for upgrading Azure DevOps Search to a new version of Elasticsearch started more than a year back. We were on Elasticsearch version 2.x and NEST version 1.x at the time, the main reason to move was to ensure that the latest version of Azure DevOps Server was shipped with a version of Elasticsearch that would be valid for the foreseeable future. If we’d known how much an upgrade would boost our performance, we probably would’ve done it sooner.

We did a quick prototype with the latest bits of Elasticsearch 5.x to see how big the change would be, how our code would work against Elasticsearch 2.x and Elasticsearch 5.x during migration, and what version compatibility looked like. It was a great exercise because we learned to classify the overall changes into 3 phases:

  1. Move to NEST 2.x
  2. Add support for Elasticsearch 5.x in parallel to Elasticsearch 2.x
  3. Remove support for Elasticsearch 2.x and move to NEST 5.x 

And somewhere in between #2 and #3 above, we needed to move all accounts’ index and search from Elasticsearch 2.x to Elasticsearch 5.x. 

Our Goals

We wanted to achieve a couple of things as part of this reindexing, and importantly, we wanted to make sure we applied the lessons learned during our last major upgrade (from 1.x to 2.x). One thing we knew we needed to do was apply ‘timeboxing’ to large reindexings, and we needed to improve the monitoring and retry mechanisms we used during the reindex. This would help achieve the goal reindexing every account with zero downtime, no exceptions.

Additionally, we wanted to refine our topologies to improve index speed for our use case. This included splitting our largest cluster into multiple smaller clusters, and in our case, increasing the number of shards per index. We also set a limit on shard size, so that none would grow beyond 50 GB.

We also wanted to test out other new configurations and settings that would be new to us in 5.x. It was a big jump and we wanted to make sure we took advantage of everything that came with this new version.

Project Documentation

During the project, we regularly captured issues and suggestions from the Azure DevOps Work-Item and tagged them as ‘ES5’. This assisted team members in picking items that helped current reindexing exercises instead of waiting for the typical “Lessons Learned” document at the end of a project. Documenting issues and suggestions throughout the project also allowed stakeholders to see the current state of the clusters during this internal exercise. This helped us create the reindexing learning document while we delivered the project, instead of creating it at the end and having to remember what happened before. By the time we started working on the learning document, more than 50% of the issues and suggestions had been already resolved.

For example, we realized that enabling automation in places that had previously failed due to temporary causes would save us from having to manually check the account again later. Automation is also useful to delete old clusters as well, and was included in our learnings. A few more fixes in our application code included better telemetry and insights support, improved file size handling, and caching support while deciding the new routing ID for a repository of a collection.

Implementation Stage

To reindex each Search service deployment, we started off by creating a new Elasticsearch 5.x cluster where new accounts could be mapped. Existing accounts continued to be served from the 2.x cluster, and new accounts were created in the 5.x cluster. We have an opt in model for Search in Azure DevOps. The first option is Code Search, where search is explicitly enabled by installing the Code Search extension. The second option is through the Wiki/Work-Item Search, which is enabled when users search the first time. During the reindexing exercise, we enabled the search service in multiple regions, as well as in our wiki, growing the number of accounts sending documents into Elasticsearch 5.x everyday.

The process of switching accounts from 2.x to 5.x started by marking all accounts as available for reindexing, which then triggered the reindex and monitoring job. We took ‘n’ accounts that were in pending for reindexing and started indexing them. The indexing happened in the new cluster and search from the older cluster. We then marked the current accounts, which already had indexing completed so that new accounts could be picked for indexing. Once all accounts were reindexed, we verified all accounts were on Elasticsearch 5.x and no indexing or searching was still happening on the Elasticsearch 2.x cluster. Once confirmed, the Elasticsearch 2.x cluster was deleted. 

The Numbers

We had close to 1.2 billion documents taking around 60 TB of disk space in Elasticsearch 2.x clusters when we started the reindexing exercise. By the time we finished the exercise the number of documents in Elasticsearch 5.x clusters were more than 2.5 billion and taking up around 225 TB of disk space. This was due to the increasing number of accounts opting in for Search Experience and the increasing amounts of data per account. We also introduced a new field in our mapping used in Elasticsearch 5.x that indexes our content field (the largest field) in another format – yet the performance numbers held steady even with the dramatically increased amount of data. 

The Azure machine configurations remained same, but we saw great performance and recovery improvements. The indexing rate improved significantly in Elasticsearch 5.x (1500 d/s) when compared with Elasticsearch 2.x (200 d/s). 

During the time we were reindexing the accounts, we had great support from Elastic support engineers and the community via discussion forums. The Elasticsearch release notes and the documentation were great too; they helped a lot in the initial days when we were reading more about the new version and when we were brainstorming the different settings and configurations we wanted to use.

Staying Engaged with the Elastic Community

It’s important to stay engaged because there are many ways to utilize Elasticsearch that we otherwise wouldn’t have considered. Our team at Microsoft has attended yearly Elastic{ON} conferences and participate in meetups when possible. It’s at these Elastic events that we realized, when it comes to performance management, people have similar mistakes, approaches, and assumptions. At Elastic{ON}, Elastic Architects are readily available at Q&A booths and willing to walk through our architecture with us. Meetups in Seattle are very useful to meet others in the Elastic community and fellow Microsoft employees who are in the space as well. Hope to see you at one!

Imran Siddique is a Senior Software Engineer at Microsoft currently enabling the search experience for Azure DevOps customers. He joined Microsoft more than 10 years back and has worked on different Azure services. Imran is a speaker for the Elastic Seattle User Group and is a host for Elastic Meetups/Office Hours in Microsoft.