Cloud consultancy reduces client outages from four times a day to just once a quarter with Elastic-based cloud automation solution

Reduces outages from four per day to just one per quarter

With Elastic at the heart of its orchestration and automation solution, Mentat helped a major government client reduce daily outages to just one per quarter.

Scales to meet the growth ambitions of multiple clients

Mentat can adapt and scale its solution to multiple telemetry and automation use cases across its client base with Elastic.

Enables clients to focus on business building rather than triage

With Elastic, Mentat clients spend more time proactively supporting their businesses instead of reacting to unexpected disruptions.

Mentat deploys Elastic Enterprise Search and Observability at the heart of its telemetry and automation solution enabling a major government client to boost cloud performance and pivot resources to the delivery of critical public-facing services.

The phrase ‘cloud computing’ sounds effortless and uncomplicated but, in reality, these advanced technologies can be as complex as their on-premises equivalents. This is where Mentat comes in. Founded in 2015, Mentat is a cloud-agnostic consultancy that helps organizations simplify, automate, and orchestrate their cloud computing architecture. This includes managed migrations and deployments of web-scale architecture in private and public cloud settings.

More efficient cloud infrastructure delivers measurable business benefits. Jonathan Doughty, Founder and CEO at Mentat, says, “Automation enables faster time to market, increased customer satisfaction, and higher employee productivity while helping companies pivot resources toward new product or service features.”

Over the years, Mentat has settled on a core set of automation and development tools for most of its engagements. Elastic Observability including Elasticsearch as a storage and search engine, Kibana to analyze and visualize data, and Logstash to parse up logs, is a main component.

Mentat also integrates Terraform, Ansible, and GitHub and GitLab with its proprietary app. It spins up its clients inside of Azure DevOps, which it treats as the orchestration layer.

"Our clients can choose from thousands of data pipelines and hundreds of Terraform and Ansible modules. Once they are up and running, we can look at customizations that further improve their cloud performance," says Doughty.

Elastic is at the heart of our architecture, acting as a holding zone for data that we need to interface or transmit between different automation systems. It’s the glue that holds all of them together.

– Jonathan Doughty, Founder and CEO, Mentat

Workflow automation drives business benefits

Mentat recently put its network and automation telemetry solutions to good use for a major U.S. city government's IT organization. Its infrastructure includes multiple data centers, tens of thousands of endpoints, and millions of users.

Before the engagement with Mentat, the organization suffered data center outages about four times a day. There was no one with the time or expertise to uncover the underlying problem. "People were constantly on the phone, whether it was troubleshooting latency issues or packet drops, or servers being offline. They were constantly fighting fires," says Doughty.

"Most city agencies have horror stories where their services are offline or unavailable for periods of time," says Doughty. "With our solution, we can identify the source of the problem and deploy a long-term fix that reduces outages to an absolute minimum," Mentat gathered data from thousands of network devices to identify the problems causing the outages. Elastic Observability was at the heart of the solution, which also included Kafka to act as a buffer and Ansible as the data collector and data manipulator.

"Elastic configurations, like automated backups on a schedule for AWS S3 storage, were extremely helpful," says Doughty. "This gave us peace of mind that the data was always available."

Elastic also enabled Mentat to set up rigorous information lifecycle management (ILM) policies. "We move data out of our hot tier after two or three days. It resides in a warm tier for about 10 days, and then we just physically move it off our platform," says Doughty.

Once all the pieces were in place, the client asked Doughty and the Mentat team to look at cyclic redundancy check (CRC) metrics and pointed them towards the network switches they suspected to be sources of the outages.

From four outages a day to just one per quarter

An initial measurement of five switches showed that one had a physical defect. When Mentat scanned the entire network, it discovered that hundreds of devices were generating CRC errors. "Once the client worked through all their physical devices and replaced or updated their cabling, they went from four outages per day to one outage per quarter," says Doughty.

The error detection system was so successful that it has since been extensively deployed across the wider client organization. For this roll out, Mentat wrote a customized task for Azure DevOps that plugs into every pipeline across its automation platform and ships that data to its Elasticsearch cluster. The client can then visualize failure rates across different task types and segments of the pipelines.

The next step is to add event-driven, self-healing automation based on Elastic Observability-based triggers. "We set up an initial trigger to view certain cloud data and send an automated email to the cloud team who can address the issue. Now, we want to set up triggers on our platform to automatically perform a specific action and accelerate time to resolution," says Doughty.

"Elastic enables us to create simple, versatile, automated solutions that can easily adapt to many client use cases. It's good for our clients and good for the growth of our business," says Doughty. Mentat also makes extensive use of Kibana dashboards to visualize data over time and in multiple ways. "Whether it's a pie chart, line chart, or even to tag cloud entities, Kibana helps clients understand and fix problems more efficiently," says Doughty.

Mentat has identified about 10 different use cases for the self-healing, automated workflow that can benefit its government client in the future. For instance, scenarios can be baked into Elasticsearch triggers that can fire and execute resolution pipelines if a web server goes down or a database needs to be recovered.

I like everything about Elastic. Its engineers are easy to work with and there's always great support. The tool itself fits our vision exactly. It is easy, simple, secure, and it scales. It is also future proof thanks to frequent new releases and enhancements that we can apply to existing Elastic deployments.

– Jonathan Doughty, Founder and CEO, Mentat