Maintaining Excellence in Commercial Performance
Sub-departments have access to 400 dashboards, which enable them to simultaneously track 50 projects in real time and to respond immediately to any events that arise.
Automatic Detection of Intrusions
By indexing server activity, Elastic enables the detection of spy robots and cyber-attacks and automatically triggers counter-measures.
IT Incidents Resolved in a Matter of Minutes
It used to take hours to search through dozens of servers whenever incidents occurred. Now we simply use dashboards that provide a summary of server logs.
A Promotion for the Technical Team
All of the monitoring which the sub-departments benefit from today, is a direct result of the work of the technical operational team. So much so, it has been promoted, and has become a Big Data strategic unit.
As the number one French online tourism site, and even the first e-commerce site in France, Oui.SNCF is the expert distribution channel for French railways. The SNCF subsidiary reached a turnover of 4.1 million in 2016 thanks to the annual sale of 86 million tickets, with up to 40 tickets sold per second during peak times. Receiving on average 13 million individual visitors per month, 63% of these visitors access the company’s services through mobile devices. Its V. application has been downloaded 15 million times, and a third of its transactions are completed via the app. In IT terms, Oui.SNCF's business is supported by 4,000 servers, split between two data centers, under the aegis of the Oui.SNCF branch, which is responsible for the technical management. These servers are teeming with potential indicators for the improvement of sales and business services.
Visualizing Data to Enhance Sub-Department Efficiency
Oui.SNCF currently utilizes 400 dashboards, some of which are permanently displayed on wall screens in order to monitor its business activity in real time. This improvement has been made possible thanks not only to the indexing of data lifted from the company site and mobile app by the Elastic Stack, but also to Kibana's dashboard creation facility. This has enabled sub-departments to maximize the performance of their services.
Oui.SNCF's Experience with Elastic
Dominique Debruyne is in charge of the Big Data technical arm at Oui.SNCF-Technologies. His current objective is to build a technical platform for sending, storing, archiving, processing, and restoring a maximum number of internal and external data sources in order to gain a better understanding of the company's customers, conduct predictive analyses, and to monitor the performance of information systems in real time. However, these are relatively new tasks. With Oui SNCF-Technologies in charge of the development, hosting, and deployment of IT tools to respond to sub-department needs, Dominique Debruyne's initial objective was in fact to guarantee the QoS and SLAs of structured data stored in relational databases.
To simplify performance monitoring, which was becoming ever more complex due to the increasing number of information systems and applications, we centralized our servers' logs in a data lake, from which we were then able to derive specific indicators. This system quickly proved hugely valuable for the technical team, and it was quite obvious that it would make sense to further extend it to meet the needs of the sub-departments as well and to get even more value out of it. And that's how the Big Data team was born, two and a half years ago.
With the aim of always delivering the very best quality of service, Dominique Debruyne wanted his colleagues and sub-departments to be able to use the Big Data tools with absolute autonomy. For this reason, he opted for user-friendly and open-source software which could be easily adapted to the needs of the users.
The Challenge: Maintaining a High Quality of Service Despite Increased IT Complexity
At Oui.sncf, an increase in servers from 2013 onward had a negative impact on the efficiency of both the technical teams as well as the sub-departments. The technical teams were losing time downloading logs on their Windows desktops in order to monitor the proper functioning of material. Meanwhile, the sub-departments were suffering with requests that would slow down the system when attempting to analyze their commercial data within what was by now a sprawling Oracle base.
In very little time at all, we'd gone from several dozen servers to several thousand! In the early days, the moment a customer raised an anomaly with us, we needed to go and search for their processes within a very large quantity of logs in order to identify exactly where the problem was. This took us time, and posed a risk to the quality of our service level.
Any delay in problem resolution could have an impact on the customer experience, thus representing a risk in terms of sales. Similarly, the slow speed often suffered by sub-departments was harming their capacity to respond to customer requests, thus affecting their commercial efficiency.
The Solution: Collect Data, Index It, and Visualize It via Dashboards
To deal with these issues, the idea emerged of removing the data from the silos that subdivided it, and placing it all in a Hadoop data lake from which PDF reports could be extracted. But there was a problem: this solution called for additional developments and did not speed up the problem detection process enough. So, the Big Data team began looking for a solution that would enable it to have a clear view of the logs, in real time.
We took part in technical conferences to find a solution that would enable the restoration, analysis, and intuitive visualization of data in real time. The decision to use Elasticsearch was agreed across the board. We saw several advantages to it: the fact that it is one unique platform rather than diverse tools, that it can withstand the majority of different usage scenarios, that it is scalable to the point that you simply need to roll out the infrastructure twice for it to double its capacity on its own, and, ultimately, that it was very simple to maintain.
In the first phase, the open-source version of Elasticsearch was installed in order to index and research data. For the visualization part, Kibana (part of the Elastic Stack) seemed an obvious choice. The benefits were immediately apparent: troubleshooting became so fast that the time to resolution was reduced from several hours to just several minutes. At a technical level, the solution is simple to install and to maintain. It remains stable so long as good practices of use are adhered to.
Deployment: Allow Each Sub-Department to Find its Own Points of Interest
Now, Oui.sncf is equipped with the Elastic Stack dedicated cluster which comprises 20 servers, holds 80 TB of data and ingests 2 TB of new information each day. This data, generally not kept for more than one month, is indexed by Elasticsearch with the aim of searching for points of interest.
The Elastic platform enables sub-departments to interact with events that are currently unfolding, to compare them to events from the days leading up to them in order to track their progression. At the same time, this data is stored in Hadoop for three years, for long-term Business Intelligence purposes. The analysis in Hadoop functions per batch, while Elasticsearch helps us do it in real time.
Since 2017, the architecture has been enriched with Apache Kafka, which allows peak loads to be absorbed and prevents any slowdowns in Oui.sncf's activity. Ingestion of the data itself is currently entrusted to Flume, an Apache Foundation open-source project. As this declines in popularity, it should soon be replaced with NiFi, its Apache successor. The architecture has been designed to facilitate predictive analysis functionalities and anomaly detection, with the latter made possible thanks to the Elastic machine learning function available within X-Pack.
Concerning the visualization with Kibana, 200 users have been trained in the creation of dashboards so that sub-departments can be totally autonomous in researching criteria which is of interest to them. Dominique Debruyne says that creating Kibana dashboards based on filters and temporal criteria is on the whole quite simple. So simple, in fact, that it has become the subject of best practice: for example, we advise users to confine their research to the single name of their project so as not to needlessly delve into all the data, or even to have the forethought to set up an automatic refresh every five minutes when the user isn't required to react to an event to the nearest second.
Regarding the dashboards, the greatest effort doesn't take place in Kibana, but beforehand. We first needed to normalize the data: in other words, develop log templates that included all of the technical and departmental information we wanted to trace, so that our dashboards were based on coherent data that could easily be cross-checked. To do this, we worked with a dedicated team for a year to produce Java, PHP or Python libraries for our applications developers which would produce normalized logs in accordance with a dozen templates, before being indexed by Elasticsearch. We are pleased to have undertaken this professional type of approach.
In fact, when new projects are developed, predefined models are used and sub-departments can grant them new dashboards very simply, without the Big Data team needing to intervene.
The Results: 50 Projects Supervised by 400 Dashboards and a Security System that Works on its Own
To date, Kibana is being used for more than 50 projects, through 400 dashboards handling 2 billion documents per day. Of these, 200 dashboards are used daily to monitor that service remains at the maximum level, to find areas for improvement, and to have as clear an idea as possible on activity.
Oui.sncf installed wall screens for the display of Kibana dashboards within each of their services, enabling employees to continually follow the course of events of interest to them. This is a visual, color-coded check: if all indicators are green, all is well. If we see that the curves are starting to drift, we head to our workstations to open up interactive tables that will help us check for problems.
And since the insights provided by Kibana are making the company's business strategies more intelligent, the idea came about to push the concept right up to artificial intelligence. For the time being, developments concern the cyber-security of Oui.sncf.
We use information indexed by Elasticsearch to, for example, identify any robots scanning our sites, in order to block them at the firewall level. Almost half of our web traffic comes from the activity of these robots. So, by dividing the number of visits in two, we have lightened the network load, and in the end, this helps us to make savings. In the same way, we are able to detect anomalies in our load balancers and can automatically trigger preventative actions to prevent us from succumbing to denial of service attacks.
Finally, on a more personal level, the Elastic Stack has enabled Dominique Debruyne to make his team more visible, with the result of making it a highly strategic entity.
The visibility the Elastic Stack gives us has formed an essential element of Oui.sncf's commercial success.
The Oui.sncf’s Clusters
Query Rate400,000 with batches
Hosting EnvironmentFrom assembly to production
Time-based IndicesDaily, weekly and monthly
Total Data Size80 TB
Node Specifications64 GB - 128 GB, local storage
Daily Ingest Rate2 TB