How Elasticsearch Helped Orange to Build Out Their Website Search
Jean-Pierre Paris, has been the Tech Lead at Orange Group since 1996 (except for four years between 2000 and 2004 when he was working at a startup in the IP traffic analysis field). In 2009, Jean-Pierre joined the Orange French web search engine team. Since late 2013, among other operational and development tasks, he's been working with Elastic products.
A web search engine is not a monolithic application. The picture below shows that there are many different types of data that contribute to building a useful and attractive response page such as a newspage (a), “did you know” section (b), French web documents (blue) with a sub-section of associated articles (red) (c) , videos (d), and the buzz (e). In addition to the web search index, we have developed specialized “vertical engines” for each type of data, like weather, recent films, or sports-related news. Each of these vertical engines are built on a dedicated corpus that is much smaller than the main French web source of documents.
a)
b)
c)
d)
e)
A History of Our Legacy
Since its beginning more than 15 years ago, the Orange French search engine — le moteur de recherche d'Orange has evolved by developing its own technology and by integrating open source frameworks. Back in 2013, each vertical engine was built on proprietary solutions or older versions of the Sphinx search engine. Moreover, every single vertical engine ran on its own set of hardware, including redundancy. This was expensive and difficult to manage, but even worse was that the complexity increased every time a new vertical engine, or new feature, had to be deployed. Last, but not least, the job of the team that runs these engines became more and more complex because of the various technologies in use. After living with this complexity for years, we realized that we needed to reduce the number of different technologies and improve our ability to quickly add new features.
Finding a New Universal Engine to Power Our Search Platform
Selecting a new technology on which to base all of our vertical engines was a difficult, but important choice for us. Different teams have developed the vertical engines, and they each had been working with different search technologies. To encourage them to switch, we had to simplify the interface, and ensure that the technology we selected could meet the many technical requirements, like high availability, high throughput and low latency.
During summer 2013, we evaluated both Elasticsearch and Solr. It quite quickly came down to Elasticsearch mainly because of the consistent, comprehensive API and the fact that it was designed from its beginning to be, well, elastic.
Elasticity — horizontal scalability — was one of the key requirements for our migration. Even though our initial roll-out was starting with a single vertical engine, we were selecting a technology on which all of our future vertical engines and features would be built. As such, the technology had to be able to scale to handle new vertical engines, but it also had to be able to grow along with our user base.
The first Elasticsearch cluster went live at the end of 2013, with just 2 indexes and less than one million documents. Today, we have 3 clusters, the biggest having 50 million documents on 20 virtual machines (8CPU, 20GB RAM and 100GB HDD). The primary size of these indices is 150GB, and we're able to process hundreds of requests per second with latency rates under 200ms, all while running on VMs rather than dedicated hardware.
Moving away from our legacy interface was made easy by the Elasticsearch REST JSON interface as JSON parsing and HTTP clients are easy to develop in almost every language. Moreover, Elastic provides client libraries for mainstream languages, which simplifies our interaction with Elasticsearch even more by hiding the low level of JSON parsing and HTTP interactions.
Thanks to the simple client libraries, our internal customers are now moving to Elasticsearch for the vertical engines. Because we can rapidly deploy new Elasticsearch indexes and clusters, it is easier than ever for our internal teams to create new vertical engines, new features, and handle more data.
Based on our experience so far, we are confident that we can operate Elasticsearch in a demanding environment. We rely heavily on Marvel, the Elasticsearch monitoring tool, to help us keep an eye on the running clusters with a 0 downtime objective for one critical cluster.
As of now, when you query the search engine on orange.fr, most of the results — whether it's about the latest sport update, a new movie or the recent volcano eruptions in Italy — you get from the vertical engines are powered by Elasticsearch.
What's Next?
We are currently experimenting with Elasticsearch for more of our internal tools. For example, we are developing a tool to analyse the readability or “quality” of the 1.2 Billion URLs in our French web database. We use Kibana to create dashboards aggregating URL information for a specific host or domain (see image below).
We can then quickly determine whether the URLs are readable and get a sense for the overall quality of the domain or host. We also use Logstash and Kibana to experiment with log aggregation for our live cluster. And I am really excited to discover Packetbeat as it takes me back more than 10 years, when I was working on IP traffic analysis.
Our success with Elasticsearch makes us want to experiment more with the entire Elastic platform for internal tools with much more data. Hence, there are definitely more stories to be told in the near future when these experiments run live. So, stay tuned, we'll be back with more Elasticsearch fun… (as you can see I am already working on it :))