• Application Search
  • Software & Technologie

GitHub: Accelerating software development


  • 2
    billion documents
  • 8
    million code repositories
  • 4
    million active users

The Challenge

How do you satisfy the search needs of GitHub's 4 million users while simultaneously providing tactical operational insights that help you iteratively improve customer service?

The Solution

By using Elasticsearch to index over 8 million code repositories as well as indexing critical event data

Case Study Highlights

Enable Powerful Search For Both End-Users And Developers

  • Scale out to meet the needs of burgeoning user base by migrating away from Apache Solr to Elasticsearch
  • Index and query almost any type of publicly exposed data
  • Enable deep programmatic search for developer applications
  • Provide near real-time indexing as soon as users upload new data

Leverage Analytics On Search Data

  • Reveal rogue users by querying indexed logging data
  • Find software bugs within the GitHub platform by indexing all alerts, events, logs and tracking the rate of specific code exceptions
  • Make queries that go beyond standard SQL

Sophisticated Searching For Sophisticated Users

Elasticsearch powers search on GitHub, the largest hosted revision control system in the world with a demanding customer base of over 4 million technical users. GitHub uses Elasticsearch to continually index the data from an ever-growing store of over 8 million code repositories, comprising over 2 billion documents. Using Elasticsearch, GitHub was able to let users easily search this data.

"Search is at the core of GitHub," says Tim Pease, an Operations Engineer at GitHub. "If you go to GitHub.com/search you can search through repositories, users, issues, pull requests, and source code."

One goal of GitHub's Elasticsearch implementation is to index everything that is publicly available on GitHub.com and make it easy to find. Of course, full-text searching is fully supported, but searching based on a wide variety of criteria is also possible and dead simple.

"You can search for a project that uses Clojure as the primary language, and has had activity over the past month, and all this functionality is powered by Elasticsearch," says Pease.

Elasticsearch's flexible storage and retrieval formats, which permit both highly structured and loosely structured data to co-exist in search storage, along with Elasticsearch's extensive set of search primitives, made search implementation straightforward. "You can do lots of queries on that data using Elasticsearch that a standard SQL database won't support," notes Pease.

Powering Analytic Insights Behind The Firewall

GitHub utilizes Elasticsearch's combination of search indexing and analytics capability to drive multiple projects. For example, GitHub found that the analysis capabilities of Elasticsearch queries could be used on stored audit and logging data in order to track users' security-related activity.

"Using Elasticsearch queries, we can quickly see every action the user has done," says Pease. "This is a great way to see whether an account has been stolen, hijacked, or whether the user has done something naughty."

When GitHub was looking to track and analyze code exceptions generated by the various software components that power GitHub.com, they originally used a popular NoSQL database. Code exceptions were stored in secondary indexes, and its analysis features were used to analyze exceptions over time with the results stored back into the database.

"It didn't work very well for our use case," remembered Grant Rodgers, a technical staff member at GitHub. "Once we moved everything to Elasticsearch and used its histogram facet queries, everything worked really well."

GitHub uses Elasticsearch's histogram facet query capability, as well as other statistical facets, to track increases in the rate of specific types of code exceptions. That process reveal bugs in their software systems.

"Elasticsearch's histogram facet query capability performs extremely well. We're looking to expand its use in that particular application," says Rodgers.


Using Elasticsearch queries, we can quickly see every action the user has done - this is a great way to see whether an account has been stolen, hijacked, or whether the user has done something naughty.

Tim Pease
Operations Engineer, GitHub

Scaling To Millions Of Users

GitHub originally used Solr for search, but found that Solr couldn't scale effectively and was more difficult to manage.

"As more people started using GitHub, we quickly exceeded the storage space that one Solr cluster and Solr instance could handle," says Pease.

Faced with the choice of sharding its own data in Solr in order to handle the load, or moving to Elasticsearch, the choice was easy.

"We decided to move to Elasticsearch because we figured they could shard things much better than we could," says Pease.

Elasticsearch offers automatic shard rebalancing to increase performance and handle failover conditions. Replica shards are automat