A time comes in the life of many development teams when they consider switching from multiple source code repositories to one (or the other way around). For the team behind the Beats open source projects, this time was a few weeks ago, and it resulted in us merging libbeat, Filebeat, Topbeat, Winlogbeat and Packetbeat into a single git repository last week.
We were all feeling for some time that having to constantly keep our eight or so repositories in sync was time consuming, but I personally thought about it more as a necessary annoyance that we can live with, rather than something we must fix in the short term.
So when someone casually brought up the idea during an internal chat, I didn’t think the cost of switching would be worth it. For one thing, with the Beats being open source, the way we organize the repositories is not just our internal business, but it is part of the way we communicate with the wider developer community around the projects. I was worried that a single repo would make the individual Beats feel less independent and might discourage folks from creating their own Beat under their personal Github account.
Nevertheless, we started to gather in a document the pros and cons and we payed more attention to all the tasks that we have to do because of the multiple repos. Small things like bumping the version number in the docs before releasing, creating new Github labels, closing and merging the changelog files, publishing the Github release, or just checking the open issues across repositories take a significant amount of time when you have to do them four times.
Yes, most of this stuff can be automated, and indeed we had various scripts and tools that helped us a lot. But even then, there was a lot of time wasted, and creating these tools also didn’t come for free. We do everything via pull requests and we review every single one, no matter how trivial. We also don’t merge them until the continuous integration systems give the green light. So, in the last few days before a release all of the tiny changes added up and were taking forever.
Then there were the tasks that are not so easy to automate, like backporting features or bug fixes between release branches. If the feature or bug affected more than one repository, the tedious and error-prone task of solving rebase conflicts had to be done more than once.
Another important aspect for us was how friendly we were to the occasional external contributor. If someone wanted to fix a bug in Packetbeat but the code was actually in libbeat, they would first have to find the code in a different repository. Then after fixing it, figure out how to update the libbeat code into Packetbeat (godep), test it, and open pull requests in both libbeat and Packetbeat. A libbeat change can potentially break the tests in another Beat as well, and we could not expect an occasional contributor to clone three more repositories and run the tests before submitting the PR.
After debating these points, we all agreed that a single repository would help us move faster and with fewer errors.
We thought that switching would be easy, because except for Packetbeat, all the other repositories are only one to six months old. But even for our young projects, there were quite a few things to consider.
The more in-progress work you have at the time of the migration, the bigger the disruption is, so we chose to do the migration in the week after a major release, when most new projects were still in the design phase.
We felt keeping the Git history was important, both to credit our external contributors and for us to be able to track down changes. Luckily the git subtree add command makes this extremely easy. We imported the code from each Beat into a subdirectory in the final repository.
One special requirement we had here was to have both the master and our latest release branch 1.0.0 available, so we actually did the subtree import twice, once for master and once for the 1.0.0 branch.
The resulting graph is messy for sure, but you can clearly see the four repositories merged into one and the 1.0.0 branch being spawned just before the merge.
Issues and links
We also had to keep in mind that GitHub issues cannot be moved cleanly between repositories. There are scrpts out there that can close the original issues and create copies with the same content in the new repository, but the new issues will all be created by the same username. Links between issues, as well as links from commit messages to issues, get broken.
For these reasons, we decided to keep the old repos around and only replace their README files with a move notice. We didn’t use a script to migrate the issues. Rather, because there weren't many issues, we moved the open tickets one by one by hand. This gave us an opportunity to clean house.
To minimize the negative effects of the migration, we kept and renamed the repository that had the most history (Packetbeat) and merged the other repositories into it.
Open Pull Requests
You can’t take over pull requests from one repository to another, so before the cutoff moment, we went through all open pull requests and either worked with the author to get them merged or closed them. Thank you everyone who helped with this effort.
This one is fairly specific to Go. We vendor our dependencies in the source tree, which means that before the migration we had a dependencies folder (Godep) in each of the repositories. Because taking over that structure would negate many of the benefits of the single repository, we decided to switch at the same time to use the Go 1.5 vendor experiment and put all our dependencies in a single top-level vendor folder.
We started using glide to manage this repository, and the transition from godeps was fairly simple. We prepared the glide.yml file in advance in an experimental merged repository, so at the cutoff time, we only had to copy the glide.yml file and use glide to create the vendor folder.
Integration tests and build systems
These systems depend on the repository structure, so they needed configuration adjustments. In our environment, we are using Travis CI, Appveyor and our own Jenkins instance to provide test coverage over the different platforms we support. Preparation was key here again. In our experimental merged repository, we prepared new travis.yml and appveyor.yml files that we could simply copy over at the cutoff time.
One negative side effect (and the biggest so far) of the migration is that these systems now have to run the tests from all the Beats on every change, making them slower in giving us feedback. This is especially a problem with Travis, which immediately after the cutover had to run 11 different targets, some of them taking quite long. Because some of the tests use Docker, it means all the tests have to run on a non-containerised system, further adding to the wait time.
To mitigate this, we moved the most time-consuming Travis targets to be executed by Jenkins only, and we’ll continue to work on making our tests faster.
What does this mean for you?
If you are using one of the Beats, not much changes for you. Our next release will be built from the merged repository, but hopefully this won’t be noticeable to you at all. Please post new issues to the new Github repository, and if you have open issues on the old repositories, please help us by moving them to the new one.
If you are maintaining a Beat or other project that depends on libbeat, we left the old libbeat repository alive, so we don’t break any existing code. However, we recommend you update your import paths to use the new libbeat path: github.com/elastic/beats/libbeat. Other than that, nothing really changes. You should continue to maintain your Beat under your GitHub account and we’ll continue to support you in maintaining it to the best of our abilities.We apologize if this caused / will cause any issues to you. It helps us move the Beats platform forward faster and safer.