Brolly greatly streamlined its entire incident management process from logging and sorting to visualization and resolution to improve its error prioritization process.
Brolly benefits from enhanced observability including error triage and can also replay error events and recover from failures.
Based in Melbourne, Australia, Brolly is a social media archiving service that enables governments and businesses to keep track of their activities on Facebook, Instagram, Twitter, YouTube, LinkedIn, and other platforms. This ability is especially critical in countries where laws require organizations to store and retrieve social media posts and conversations with customers or citizens.
To ensure that its customers can access these records, Brolly uses cloud-based servers and dozens of microservices that support its operations and enable customers, such as the City of Sydney and the Queensland Government, to access their social media history at any time.
As the company grew and its systems expanded, it became challenging to monitor and prioritize system and application errors to ensure customers have non-stop access to archived social media records. "There are so many logs that we have to collect, from the overall infrastructure, to how users are clicking within the platform, or how requests are made to different APIs," says Ali Nazemian, Chief Technology Officer, Brolly.
In such an advanced microservice environment, software errors are typical and even expected. "To a layperson that might sound dramatic," says Nazemian. "But not all errors are critical. The challenge is to collect system logs, sort and prioritize errors, and devote the appropriate resources to fixing them."
To keep track, the IT team used a combination of solutions ― including a fast-growing collection of customized dashboards ― for logging, sorting, and resolving issues. "It was an increasingly complex and difficult process," says Nazemian. "We were managing lots of tools and sifting through large amounts of data to figure out what was going on and what to prioritize."
"This inability to contextualize and correlate logging errors at speed was expensive, took up a large amount of employee time, and threatened to compromise customer service. It was clear that we needed a more effective solution for the incident management team," says Nazemian.
After evaluating several solutions, Brolly chose Elastic Cloud to run nearly all its incident management processes, including search, observability, and security across its infrastructure. Elastic collects and centralizes logs across all of Brolly's systems, keeps track of failures in the data pipeline, and speeds up data recovery.
Brolly uses Elastic Filebeat, Metricbeat, and APM (Application Performance Management) to collect information about system errors and data quality issues. "With Beats, it's much easier to get information together in one place, whether it's from your microservices or infrastructure," says Salman Ahmed, Solutions Architect, Brolly.
All the data then goes into Elasticsearch for indexing while Kibana is used to visualize and present data. "We no longer need to go to every service individually to look into the logs of those services. We now have them all in the same place, which saves a huge amount of time," says Ahmed.
The Elastic Stack enables us to manage incidents and errors across the entire lifecycle, from logging and visualization, to replay and resolution.
Ahmed also stresses how Elastic enables the incident management team to contextualize logs and correlate them with actions and incidents across all its microservices. "Being able to create this context view within Elastic is really powerful and gives us the ability to log and then track the whole lineage of a social media record," says Ahmed.
Using APM, the Brolly team can track the progress of a record across different microservices. If the record fails, they can see which services it has passed through and determine where it has been processed properly, and where there may be an error.
In the past, it might have taken us a couple of days to fix an error. With Elastic, it is so much faster than before. It now typically takes only 30 minutes to find the issue, resolve it, and then track the fix.
With the deployment of Kibana, Brolly can better detect, prioritize, and report incidents and anomalies. "With Kibana we can do anything from tracking query load to understanding the way requests flow through our apps. It brings everything together in a clear, simple display which doesn't compromise the level of detail required by our engineers," says Ahmed.
The Brolly team is also making extensive use of Elasticsearch watchers. These are configured to trigger actions based on a specified condition. "Watchers are really helpful for analysing mission-critical streaming data," says Omid Mirzaei, Software Engineer, Brolly . "I can create a watcher that alerts us with an email or a Slack message when a data clean-up fails. That's just one example, the possibilities are endless."
Watchers are really helpful for analysing mission-critical streaming data. I can create a watcher that alerts us with an email or a Slack message when a data clean-up fails. That's just one example, the possibilities are endless.
"This is where the power of Elastic really comes into play," says Nazemian. "Across our entire platform, we can monitor and be alerted about various events, such as a user connecting a new account or generating an export." Looking to the future, Nazemian is confident that the Brolly team will continue to save time and money, enabling engineers to focus on enhancing the customer experience, rather than spending days trawling through data logs and complex spreadsheets. "Since we started using Elastic, we've seen important enhancements to their technology. We now have complete confidence that we can continue to scale to meet the needs of our customers as they expand their use of social media," he says.