Awash in a sea of gene and variant information
The availability of genetic and genomic information has exploded in the last decade following decreasing costs in sequencing technology; however, much of this information exists scattered over many different resources. For example, different resources on the same gene often have different identifiers, formats, and information. The fragmented data landscape makes creating and maintaining bioinformatics pipelines challenging, frustrating, and time consuming.
As part of Dr. Andrew Su’s (Associate Professor) computational biology research group at the Scripps Research Institute, our team is interested in solving big data challenges like the aforementioned fragmented gene/variant data landscape. Dr. Chunlei Wu (Associate Professor) spearheaded the endeavor to create easy-to-use gene and genetic variant annotation services so that researchers can spend more time making new discoveries and less time on dealing with the fragmented data landscape. By using our free* service, users can obtain up-to-date gene and variant information in a consistent format (JSON), from any of the two endpoints used for each service.
Building the solution with Elasticsearch
MyGene.info was the first of the two annotation services we built. In building our services, we knew there were several issues we needed to consider:
- We would be aggregating data on 13 million genes from 7 databases for MyGene.info
- The amount of data from each data source AND the number of data sources were expected to continue to grow, so our service must be able to scale accordingly.
- Users would need to be able to find the information they needed quickly, with flexible ways of finding it, without perceptible drops in performance as the amount of data grows.
Given these constraints, we employed Elasticsearch in our Indexing Engine. Our previous experience with CouchDB for a different resource, enabled us to smoothly transition into using Elasticsearch and we were early adopters of Elasticsearch (circa v0.5.x). Even at the earlier stages of development, Elasticsearch has been a valuable tool in our arsenal, and we had no doubt it would be able to suit our needs.
Applying our success in building MyGene.info into a highly scalable service, we followed by building MyVariant.info to address the even more fragmented data landscape of genetic variant information. MyVariant.info currently has more than 334 million unique gene variants from over 14 databases.
By using Elasticsearch in our services, users would be able to search for one or thousands of gene or variant-specific JSON object(s) using flexible query terms and return just the information of interest to them. If they were only interested in variant annotations from dbSNP or gene annotations from worms, they would be able to specify those filters in their search. Most importantly, users could get their results quickly. According to our recent paper released in Genome Biology, MyGene.info can handle traffic from >5000 concurrent users for approximately 10,000 requests per minute; and over 95 % of actual user requests take less than 30 ms to process. MyGene.info receives requests from over 4000 unique IP addresses on a monthly basis, while MyVariant.info caters to roughly 1,500 unique IP’s each month.
A Schematic design of the MyGene.info architecture from Xin J et al’s “High-performance web services for querying gene and variant annotation.” Genome Biol. 2016 May 6;17(1):91. doi: 10.1186/s13059-016-0953-9. PubMed PMID: 27154141; PubMed Central PMCID: PMC4858870.
Tracking our success with Kibana
We already had BioGPS.org, a well-used, user-friendly resource, which originally utilized CouchDB (v1). As we migrated the service over to utilize MyGene.info, we wanted a way to distinguish the MyGene.info traffic coming from BioGPS.org from our various clients (python, R, etc). We utilized Kibana to help visualize the different sources and volumes of traffic for MyGene.info and MyVariant.info. Both MyGene.info and MyVariant.info consist of two endpoints each, and Kibana was an easy way for us to inspect the usage of our service endpoints.
Scaling towards other BioThings
MyGene.info currently has 10 shards spread across two web nodes, three master nodes, and three data nodes. Scaling up from 13 million genes to cover 334 million variants, MyVariant.info is made up of 20 shards spread across three web nodes, three master nodes, and five data nodes. We use load balancers to handle the queries coming into our web nodes to ensure fast and stable processing. Given the lessons learned on scaling when we developed MyVariant.info following MyGene.info, we expect to be able to readily extend coverage to other research areas with excess data fragmentation. Gene annotation and Variant annotation data are only two examples of “BioThings” with fragmented data sources, and we hope to expand our service to be of greater use to the research community.
For more details about the MyGene.info and MyVariant.info services, please see the research paper published in Genome Biology.
The MyGene.info development team is part of Dr. Andrew Su's computational biology lab at the Scripps Research Institute. The Su lab is interested in making existing information more useful and has released bioinformatics tools such as BioGPS.org, MyGene.info, MyVariant.info, Knowledge.bio, and the citizen science program, Mark2Cure.org. The development of the MyGene.info and MyVariant.info services is led by Dr. Chunlei Wu, who oversees a team consisting of graduate students, post-doctoral research associates, research programmers, and an outreach manager.
*Both MyGene.info and MyVariant.info are free services; however, certain data source incorporated into these services may have additional restrictions on the use of their data. For example, CADD is only free for non-commercial use. Always, check the licensing restrictions for the data sources you are using.