Mentat recently put its network and automation telemetry solutions to good use for a major U.S. city government's IT organization. Its infrastructure includes multiple data centers, tens of thousands of endpoints, and millions of users.
Before the engagement with Mentat, the organization suffered data center outages about four times a day. There was no one with the time or expertise to uncover the underlying problem. "People were constantly on the phone, whether it was troubleshooting latency issues or packet drops, or servers being offline. They were constantly fighting fires," says Doughty.
"Most city agencies have horror stories where their services are offline or unavailable for periods of time," says Doughty. "With our solution, we can identify the source of the problem and deploy a long-term fix that reduces outages to an absolute minimum," Mentat gathered data from thousands of network devices to identify the problems causing the outages. Elastic Observability was at the heart of the solution, which also included Kafka to act as a buffer and Ansible as the data collector and data manipulator.
"Elastic configurations, like automated backups on a schedule for AWS S3 storage, were extremely helpful," says Doughty. "This gave us peace of mind that the data was always available."
Elastic also enabled Mentat to set up rigorous information lifecycle management (ILM) policies. "We move data out of our hot tier after two or three days. It resides in a warm tier for about 10 days, and then we just physically move it off our platform," says Doughty.
Once all the pieces were in place, the client asked Doughty and the Mentat team to look at cyclic redundancy check (CRC) metrics and pointed them towards the network switches they suspected to be sources of the outages.