Anthony Evans from StackState wrote this blog post. Anthony is a Solutions Engineer with StackState and a lifelong technologist. He has spent the last 15 years helping companies advance their software delivery capabilities while maintaining reliability, working extensively to support customers in the SaaS, AI, and Service Management landscapes.
Meeting the goal of delivering great performance and reliability in the face of our ever-changing, increasingly autonomous IT environments is fundamentally challenged by a data problem. Sure, there’s lots of it - logs, metrics, and APM traces - but it is exceedingly hard to extract actionable information when there are so many fast moving parts.
If you’re already using Elastic Observability, you know the major advantage of bringing your monitoring data together into one unified view. When you add StackState on top of Elastic’s strong data foundation, you enrich it with a deep, real-time view of topology and its constant changes. This is the contextual layer that is key for pinpointing the root cause of most performance problems and preventing them from occurring in the future. Read on to see the magic that happens when Elastic and StackState team-up.
Before the relationships between components became so complex, you could practically pinpoint a problem off the top of your head. If your shopping cart service wasn't working, you could point to a set of particular machines, physically log into them, and figure it out. Everything sat in one place, so it was simple. You could build dashboards apriori and they would provide great benefit during times of need.
Nowadays, your systems are continuously changing. If you get an error or alert that suddenly says, “I can not put stuff in my shopping basket anymore,” there could be a thousand reasons why. And even though you have a lot of data, it's overwhelming rather than helpful because (1) you're drowning in data and (2) you miss the context to act on it. It becomes very difficult - if not impossible - to manually find a root cause in this ever-changing, super complex environment. This is why you need the ability to automate root cause analysis.
The importance of a time-traveling topology for root cause analysis
Without a time-traveling topology however, root cause analysis is a lot like searching for a needle in a haystack. When a problem occurs, what you can do is work on a set of hypotheses and start checking dozens of different logs from dozens of different services. But that is a lot of work.
Instead, a good place to start is by asking, “What changed?”. For example, your cloud provider’s autoscaler spun up a new machine, a number of Kubernetes pods started to run on that machine as part of a set of services, and all of that coincided with a configuration change in a Kubernetes deployment. Let’s say you take all that change data and you stick it into Elasticsearch. Elasticsearch ingests, stores and analyzes all this data at scale. Then, Elastic Observability puts all that monitoring data into one unified view. You now have a powerful observability data foundation, but there’s one thing missing: you need to know how your data relates to each other.
Tracking changes and dependencies
That is exactly where the magic happens between Elastic and StackState. The tracked changes in StackState’s time traveling topology make it trivial to see the correlation between the change caused by the auto scaler and the change in Kubernetes deployment, and the causality between the two. You now know that this machine is part of a production Kubernetes cluster running in a certain cloud region. Kubernetes started running pods that all belong to a single deployment - it’s the net around all those things.
Sounds good, doesn’t it? StackState pinpoints the root cause of change-induced problems in highly dynamic environments, and Elastic Observability then gives you all the relevant in-depth details. However, there’s one more challenge: doing this typically requires a lot of data, especially if you want to keep this data around for several years to build enough context for Elastic features like machine learning to work on and produce most accurate results. That data has gravity and there is a lot of physics involved.
No need to copy Elastic Observability data to StackState
Luckily, there’s a solution for that: StackState can work with the existing data on Elastic without requiring you to copy it over to StackState. This is important because Elastic allows you to perform savvy-based decisions regarding data storage by providing the concept of data tiers: a hybrid storage approach where hot data can be kept on high-performance disks, and warm data can be kept on lower-cost performance disks.
Most recently, Elastic also allows you to tune your data compute and storage utilization using the frozen tier. This means you can decouple storage from compute and store years of data without going crazy in terms of infrastructure costs. StackState will only store the time traveling topology and will leverage Elastic data as if it were all part of the same platform. Retention configurations of StackState’s time traveling topology are also entirely decoupled from Elastic’s observability data.
So, if a customer says, “I have a few exabytes of log data of my cloud native applications in Elastic Cloud”, then what StackState will say is, “Okay, that's fine. Keep all of that.” Because when you are investigating an issue, you can use StackState’s time traveling topology to do so. Also, when you need to access the relevant metrics or logs to further investigate the issue, you can do so through StackState by analyzing the data that comes directly from Elastic.
But that’s not all: StackState can use existing data in Elasticsearch to extend the topology in StackState. All you have to do is let StackState read the topological data from Elasticsearch and it will automatically build a time-traveling topology. Consider how much more value you can get out of beats such as osquery and packetbeat sitting in your Elasticsearch cluster!
Running side by side, customers would have all the benefits and capabilities of Elastic’s powerful observability infrastructure with the added bonus of StackState’s time-traveling topology that says, “Ding! Ding! Ding! Here’s where your problem happened!”
To conclude, here’s where the magic happens between Elastic and StackState:
- Ingest all your telemetry data into Elastic and use its very awesome cloud infrastructure.
- Use StackState’s real-time discovery to build a topology, track all changes and contextualize Elastic Observability data.
- Quickly pin-point, dive-deep, resolve— and ultimately prevent — problems altogether in dynamic cloud-native environments.
Pretty neat, right? If you want to understand a bit more about how StackState can help you to quickly identify the root cause of service infrastructure issues, join us on Sep 23, when Anthony Evans, Solutions Engineer from StackState, will demonstrate the product live and answer any questions you may have.
Want to learn more? Reach out to StackState directly to schedule a personal demo. Or, if you want to see more of how StackState works on your own, explore the hands-on playground, or request a free trial of your own. We’re looking forward to helping you make the most out of your data.