More Signal, Less Noise with Elasticsearch
|“Putting Elasticsearch at the centre of what we are building has proven itself to be one of the best technology decisions that we have made during our growth."
- Wesley Hall, CTO Signal
Wesley Hall is the co-founder and CTO at Signal, a London-based startup focused on providing information-dependent individuals and businesses with personalised, high-relevance news feeds enabling more informed planning and decision making. Wesley has over 15 years experience in architecting and building complex software systems for organisations such as eBay, Fab.com and Aimia.
By the time you finish reading this sentence, the Signal platform will have received, extracted, analysed, classified and indexed a volume of text equivalent in size to the complete works of Shakespeare. This is a process that continues, in real time, 7 days a week, 365 days a year, allowing users to intelligently strategize and make data-driven decisions.
The Importance of Search
As a startup, and lacking the resources of the well-known search behemoths, we needed to create the kind of infrastructure that could manage this data and make it available instantly, and we needed to do this on a budget. Since a primary function of our product is search and discovery, what we really needed was a highly-effective, open source text search infrastructure component that could be relied upon to scale up with our increasing data volumes. We found this component in the fantastic Elasticsearch system.
The Signal software system is effectively a large scale data processing pipeline that runs sophisticated text analytics routines. As we receive our incoming stream of news articles and blog posts, each is processed through this pipeline, extracting references to important entities such as organisations, people and places. Articles are classified into topics using advanced models built using our machine learning technology. Finally, articles are grouped into clusters of where they described related events or stories. The result of this processing pipeline is large volumes of text (up to 7 million individual articles per day and growing) with associated meta-data describing the meaningful entities and concepts that have been detected by our system.
Once this data has been prepared by our pipeline system it is indexed into an Elasticsearch cluster consisting of 15 fairly substantial, cloud-based server instances making it available for search, exploration and discovery by our users.
The decision to place Elasticsearch at the centre of our infrastructure was really a matter of coming to the realisation that search is not just a feature of our product; it is our product. Our entire offering is based around providing users with the ability to locate relevant information. Our system divided neatly between the extraction of this relevant information and the ability to make effective use of it.
Previously, search had been much more of a secondary feature as we focused on the delivery, inbox-style, of the relevant articles to the interested users. The shortcomings of this approach quickly became obvious, however, as we began to consider where our users may want to go next. The feeds themselves offer an excellent jumping off point into further exploration and discovery of the available content but this service was difficult to provide without an effective search system.
Up until this point we had been using the AWS cloud search product to provide the search capabilities of our system, but cloud search lacked the advanced features that we needed such as complex aggregation or sophisticated monitoring and control. We needed something with these features and more…
Elasticsearch was an obvious forerunning candidate to provide the infrastructure required for this project. Having worked (albeit briefly) with Elastic CTO Shay Banon as part of an eBay project, I was already aware of the degree of technical competency found in the Elasticsearch software. I knew that Elastic really would mean elastic and having confidence in the initial creators is something that takes me a long way when it comes to my technology selection process. We began a proof of concept implementation towards the end of 2014, and like many successful PoCs this eventually grew in to our V2 product.
Reimagining the View on Data
Reimagining our product as being a multi-faceted search system has changed its nature and how we approach its evolution. We move forward by providing our users with different views and perspectives on their data, all of which are provided by combinations of features found native in the Elasticsearch system. The ability to create visualisations of complex aggregations introduces entirely new ways to look at the world.
Better still, by having all of our data available in Elasticsearch, our development and research teams can quickly explore ideas and concepts using tools like Kibana to test the value of a particular approach to data delivery and visualisation. This ability to ‘mock up’ a technique before taking the time to integrate it into our core product has enabled us to take a far more experimental approach in our development process.
Engaging Elastic to provide development support for our infrastructure is proving to be one of those decisions that we really wish we had made earlier. We have found real value in having access to the core team at Elastic to ensure that we are using the system effectively.
Putting Elasticsearch at the centre of what we are building has proven itself to be one of the best technology decisions that we have made during our growth. It’s truly been a joy to work with and we look forward to finding new and interesting ways to leverage its feature to provide the kind of value that our customers have come to expect of us.