Sophisticated Searching For Sophisticated Users
Elasticsearch powers search on GitHub, the largest hosted revision control system in the world with a demanding customer base of over 4 million technical users. GitHub uses Elasticsearch to continually index the data from an ever-growing store of over 8 million code repositories, comprising over 2 billion documents. Using Elasticsearch, GitHub was able to let users easily search this data.
"Search is at the core of GitHub," says Tim Pease, an Operations Engineer at GitHub. "If you go to GitHub.com/search you can search through repositories, users, issues, pull requests, and source code."
One goal of GitHub's Elasticsearch implementation is to index everything that is publicly available on GitHub.com and make it easy to find. Of course, full-text searching is fully supported, but searching based on a wide variety of criteria is also possible and dead simple.
"You can search for a project that uses Clojure as the primary language, and has had activity over the past month, and all this functionality is powered by Elasticsearch," says Pease.
Elasticsearch's flexible storage and retrieval formats, which permit both highly structured and loosely structured data to co-exist in search storage, along with Elasticsearch's extensive set of search primitives, made search implementation straightforward. "You can do lots of queries on that data using Elasticsearch that a standard SQL database won't support," notes Pease.
Powering Analytic Insights Behind The Firewall
GitHub utilizes Elasticsearch's combination of search indexing and analytics capability to drive multiple projects. For example, GitHub found that the analysis capabilities of Elasticsearch queries could be used on stored audit and logging data in order to track users' security-related activity.
"Using Elasticsearch queries, we can quickly see every action the user has done," says Pease. "This is a great way to see whether an account has been stolen, hijacked, or whether the user has done something naughty."
When GitHub was looking to track and analyze code exceptions generated by the various software components that power GitHub.com, they originally used a popular NoSQL database. Code exceptions were stored in secondary indexes, and its analysis features were used to analyze exceptions over time with the results stored back into the database.
"It didn't work very well for our use case," remembered Grant Rodgers, a technical staff member at GitHub. "Once we moved everything to Elasticsearch and used its histogram facet queries, everything worked really well."
GitHub uses Elasticsearch's histogram facet query capability, as well as other statistical facets, to track increases in the rate of specific types of code exceptions. That process reveal bugs in their software systems.
"Elasticsearch's histogram facet query capability performs extremely well. We're looking to expand its use in that particular application," says Rodgers.
Scaling To Millions Of Users
GitHub originally used Solr for search, but found that Solr couldn't scale effectively and was more difficult to manage.
"As more people started using GitHub, we quickly exceeded the storage space that one Solr cluster and Solr instance could handle," says Pease.
Faced with the choice of sharding its own data in Solr in order to handle the load, or moving to Elasticsearch, the choice was easy.
"We decided to move to Elasticsearch because we figured they could shard things much better than we could," says Pease.
Elasticsearch offers automatic shard rebalancing to increase performance and handle failover conditions. Replica shards are automatically distributed to new nodes in a cluster and, in the case of node failure, shards are automatically migrated from failed nodes to good nodes.
Advanced Sharding for High Performance
With over 2 billion documents, all indexed by Elasticsearch, and with users constantly uploading and modifying code, search performance is a key metric for the GitHub team. GitHub serves, on average, 300 search requests per minute.
GitHub uses Elasticsearch to index new code as soon as users push it to a repository on GitHub. The data can be searched on very soon after, and search results are returned for both public repositories, and, for logged-in users, any private repositories they can access.
To optimize access to search data, GitHub uses sharding extensively.
In GitHub's main Elasticsearch cluster, they have about 128 shards, with each shard storing about 120 gigabytes each.
To optimize search within a single repository, GitHub uses the Elasticsearch routing parameter based on the repository ID.
"That allows us to put all the source code for a single repository on one shard," says Pease. "If you're on just a single repository page, and you do a search there, that search actually hits just one shard. Those queries are about twice as fast as searches from the main GitHub search page."