SciBite offers Semantics as a Service. We take text, in any format, and extract the scientific terminology using formal named entity recognition. It’s an incredibly fast solution that integrates seamlessly with existing systems.
SaaS is available through its Java based RESTful API, with the following capabilities:
- Highly curated scientific ontologies, built upon open standards
- Formal based Named Entity Recognition
- Relationship mapping and extraction, identifying patterns
- Elasticsearch-enabled indexing of semantically rich data
- Live enrichment of browser based content
- Seamless connectivity to third party applications, providing search and connectivity
We provide pluggable technology, so that scientists can integrate semantic enrichment exactly where they need it. Fast, lightweight and simple to use, we transform data by providing technologies that understand the scientific content they process. Our clients span a range of industries, but are predominantly from the life sciences sector, and it’s the pharmaceutical industry that we’ll focus on in this post.
All drugs originally have an exclusive patent on them, meaning only the patent owner can manufacture and sell them. But all patents eventually come to an end, meaning other companies can make a generic — and redirect up to 90% of sales (according to Medcity News).
So what do you do if you’re a pharmaceutical company? Well, ideally, you create a new drug. But creating a new drug is a costly process — an extremely costly process. First, there are the huge resources required to do preliminary research: which disease/syndrome? What drugs are already available? What do they target? Where is there an opportunity? Then comes the actual research, the different trial phases, and all the paperwork to achieve regulatory compliance. We’re talking billions of dollars to get a brand new drug to market (more than $2.5 billion according to Scientific American).
So, how do you continue to do business without such a huge investment?
What if you could find a different use for a drug that was already on the market? A lot of the groundwork will already have been done: you know it’s already been approved as safe for humans, so all you have to do is prove that it’s effective for a different issue or indication. Then you have a new patent and a new exclusivity. This process is much more efficient.
There is a snag to all of this, though.
Finding out all those links to other indications still requires significant research investment. Imagine trawling through millions of scientific papers, looking for secondary effects of drugs. Pretty painstaking, and because there are so many synonyms for the same indication, drug or gene/protein, a simple text search (much like when you use a well known search engine) will miss many relevant results, as it doesn’t understand the scientific meaning of all those different synonyms.
This is where SciBite, using Elasticsearch, steps in.
Looking at repurposing Dipyridamole
Dipyridamole is currently used for treating Angina, a Coronary Artery Disease. At Ariel University in Israel, they are looking at running Phase II trials on using the same drug for Dry Eye Disease, which can lead to loss of sight.
Let’s see how they could have jumped from Angina to Dry Eye Disease if they’d used SciBite and Elasticsearch technology.
Starting from the standpoint of having absolutely no knowledge about this drug, we first need to find out more. Let’s run it through DOCstore, our semantically enabled search engine.
Q1) What are the major diseases (or indications) it treats?
A simple search for “Dipyridamole” comes up with the most frequently mentioned disease terms.
As we see on the right of the image, the most common of these are Coronary Artery Disease, Ischemia, Myocardial Infarction, Thrombosis, and Stroke. Quite different from Dry Eye Disease, aren’t they? Which brings us to:
Q2) How on earth is Coronary Artery Disease linked to Dry Eye Disease?
Drugs work by binding to molecular targets, or a protein. These proteins belong to a type or protein family, if you like. By doing this, a drug can alter how a cell behaves. Now, this relationship between a drug and a protein isn’t necessarily exclusive to a single disease. It can be relevant to other diseases too.
Taking a closer look at the Gene / Protein and Protein type boxes, we get a strong idea of the top co-occurring proteins and protein types.
As we see from the image above, adenosine pops up the most. It would be useful to find out a bit more about this interaction and where else it could happen. So, on to…
Q3) Where else do these drug-protein interactions happen?
We can click to filter on a particular protein type of interest – in this case, adenosine.
The image below shows a matrix view of the less well known anatomical features where this is relevant. Why are we looking at the less researched areas? Well, if we look at the most researched areas, chances are another group will have already tried their luck with repurposing for some of those diseases. By going for the other end of the spectrum, we’re less likely to encounter competition.
When we click on ‘eye’, DOCstore returns diseases that come up as ocular hypertension, glaucoma, inflammation, and keratoconjunctivitus sicca – also known as Dry Eye Disease.
Now we have eight documents that nicely describe the background to this repurposing example, two of which are below and have been highlighted by SciBite’s TERMite and DOCstore to pick up on Dry Eye Disease or Syndrome.
How did we do it?
All of this was possible because of our semantic search platform, which is different to how traditional search engines work. Try and ask your favourite search engine for other diseases associated with the drug, and you’ll find it’s hard to get a quick and precise answer. Most search engines deal only with text, which has no meaning to a computer - in other words, it's not semantically represented. Our approach combines two key elements:
- The semantic annotation, or SciBite’s TERMite, our Named Entity Recognition engine
- Our search platform, DOCstore, which we’ve built on Elasticsearch
TERMite recognises concepts in text like drug names or diseases. Remember we mentioned that these can have multiple names? Viagra for example, is also known as Sildenafil Citrate, or sometimes just Sildenafil. Multiple synonyms for the same entity. TERMite does this upfront rather than in a synonym token filter, because disambiguation is important and TERMite excels at correctly identifying entities. One other example would be the finding of the word ‘GSK’ in some text. Does this refer to the company (Glaxosmithkline) or the protein (Glycogen Synthase Kinase)? TERMite contains the domain expertise to identify entities relevant to our life sciences space.
Here’s an example of TERMite in action on an article on Pubmed. The images show an abstract after concepts have been annotated by TERMite.
TERMite semantically enriches text. This then gets indexed into Elasticsearch. Once there, we can use the power of Elasticsearch to do semantic search queries and analytics on large corpuses of biomedical literature.
So why did we do that? The horizontal scalability, myriad of analytical aggregations, clean API, excellent documentation and vibrant developer community were key reasons why we opted for Elasticsearch. In particular, the Elasticsearch aggregations framework provides an elegant solution to really facilitate answering some of our customers’ key questions.
The ~27 million PubMed abstracts available from the National Institute of Health in the USA forms one of the datasets we index in DOCstore. We rely heavily on aggregations, mainly bucketed term/significant term aggregations for our analytics. We also employ date histograms for trend analysis and in the near future we’ll be using more of the geo aggregations to help analyse data from our geolocation-based vocabularies. Our platform allows researchers to search for entity types or instances in documents regardless of which of the many names they might be referred by. Simply put, this is a semantic search system, which relies heavily on Elasticsearch’s ability to store, aggregate and serve large amounts of text-centric data.
Drug companies exist to make more drugs, and anything that helps them do that more quickly, with fewer resources is going to be of interest, and has certainly helped us to build our client base.
Search is becoming more semantically aware in the life sciences and Elasticsearch is underpinning a lot of this drive forward.
Phil Verdemato, Head of Software Engineering at SciBite
A geek at heart, Phil has spent the majority of his career writing code to solve life science problems. At SciBite since 2015, he’s now part of a global movement to transform data and how we interact with it. In his spare time, he enjoys building go karts and tinkering with his Raspberry Pi.