Like a lot of companies, Paddy Power invested heavily in log aggregation and analysis, which quickly became an absolutely critical tool, both from an operational perspective and also for the development teams. And Splunk became the go-to place to find answers to many questions relating to performance and what was going on in the systems.
Paddy Power sees double digit growth year on year and not surprisingly we rather quickly hit our daily license limits in what we were indexing in Splunk. We got a temporary increase while a budget request was submitted for a license increase.
That request was rejected so I was set a challenge to build an alternative to Splunk that not only provides the functionality we were so used to, but do it for a fraction of the cost – thankfully the fraction was not specified! And to have production logs in this new system within six weeks, which was more like four weeks as it was just before Christmas.
My team had previously looked into other log analysis platforms and had built and evaluated the various different tools. We had concluded that the only product that would meet our needs was the Elastic Stack. This meant we could use our limited time in building a solution to start with the Elastic Stack straight away, and not need another POC phase.
We knew we could set up an Elastic Stack ourselves, but we also knew this would not be fit for Paddy Power's purpose and wouldn’t have a chance of scaling, so we needed help from the experts. I decided to go directly to the source and contacted Elastic.
We had a few calls to discuss what we wanted as a solution and signed on the dotted line for a 10-node cluster. The goal was to scale this as we migrated more and more to the Elastic Stack. We also took an eight-day consultancy package with the aim of having our cluster production ready by the end of it.
I had one full-time employee dedicated to work with the Elastic consultant and an extra person available to help out when needed. Even though we ran into some challenges in the build phase, at the end of those eight days, we were indexing production logs for the specified application in parallel in both the Elastic Stack and Splunk. We confirmed that searches were returning the same results in both stacks and replicated the dashboards the users were so fond of. The build can only be described as manic. We helped Elastic re-write Puppet modules, found bugs in the latest Filebeat and were running on pretty much Betas of every stack. But it was a hugely entertaining and rewarding two weeks.
Due to the tight timelines, we didn’t have time to procure dedicated hardware so we plonked the entire stack on virtual machines. We knew there would be performance issues with this but we ended up seeing massive lag during our peak times. We engaged with Elastic as running on virtual machines was not the only contributing factor. We identified performance bottlenecks in our implementation of Logstash and other various components in the Elastic Stack. I have never been so impressed with Elastic support as I was when we had calls with the original authors of Filebeat, Logstash, and other Elastic Stack components. We had new agents custom written and these improvements were merged back to the master branch so hopefully people reading this are enjoying those improvements and bug fixes!
Due to the migration, the second phase of the project was dedicated to hardware and migrating other applications into the Elastic Stack, but since our merger with Betfair, we are about to look into the Elastic Stack as a solution for the entire combined organisation! I can’t give a higher compliment than that!
Kevin Moore, Systems Engineering Manager in PaddyPower Betfair. I have been with Paddy Power for 5 years and started as a Senior Linux Admin. My team is responsible for the almost 8000 *nix systems that run the Paddy Power website, mobile apps as well as internal apps and tools. We help the business deliver stable, reliable and scalable features at pace and are responsible for a plethora of the usual tools and applications you would expect to find in an engineering/DevOps team.