How to

Beyond The Math: Effective Machine Learning In Security

Editor’s Note: Elastic joined forces with Endgame in October 2019, and has migrated some of the Endgame blog content to See Elastic Security to learn more about our integrated security solutions.

In an attempt to appeal to information security executives and practitioners, some vendors have positioned Machine Learning (ML) – often liberally decorated as “Artificial Intelligence” (AI) – as a panacea for information security’s challenges. In many cases, the hype has gone well beyond reality and become marketing nonsense. Is there a useful place in information security for machine learning? Most definitely. Will algorithms replace domain experts? No. Is machine learning automagically better than other carefully crafted detections because, well, math? No.

Whether you’re on the vendor side (as I am) or on the user side (as many of you surely are), there is far more nuance to effectively using ML to achieve your security objectives. ML can be extremely powerful, but it is not always the answer. Even when it is the best tool for the job, it is very easy to screw up a ML model due to the use of bad data, incomplete data, the wrong features, or a variety of other factors. Worse, vendors’ poorly explained metrics and unverified claims can make it impossible for users to recognize bad implementations.

Even the concepts themselves are misused and mischaracterized interchangeably, leading to the necessity for articles to ‘demystify’ AI and ML in security. To simplify, AI is the practice of making computers behave or reason in an “intelligent” manner. ML is a subset of AI, in which the computer makes predictions or insights from data. In security, most discussions related to AI are actually just ML so we will stick to that terminology here for consistency.

As ML remains a hot buzzword in security, the misconceptions surrounding it only seem to be increasing. I’ll first address many of the misconceptions of ML in security, and then highlight the advantages of good ML-driven solutions when properly implemented. Given the range of pitfalls and considerations required to optimize a ML-driven solution, a hybrid approach of domain expertise combined with ML is necessary. It is this interplay that has made our research and development so successful, and continues to innovate our detection and protection capabilities within the Endgame platform.

Common Misconceptions

Various myths about ML are gaining a foothold in the security community. Many industry veterans are rightly skeptical and growing increasingly cynical as they watch vendors claim their magical ML solutions will solve all security problems. Machine learning is not the silver bullet often claimed. Below are four of the biggest misconceptions in security that result from that type of messaging.

  1. ML REPLACES THE NEED FOR SKILLED EMPLOYEES: This is not likely in infosec. According to some projections, the industry will face close to a two million workforce shortage by 2022. Demand for talent is likely to outstrip supply for a long time. Vendors need experts well versed in both ML and security to properly implement ML-driven security products. In addition, security teams need people to interpret the results and take actions via their tools, whether they are ML-driven or not. Some organizations may be able to reduce the need for additional resources, but don’t expect a significant drop. ML can augment your current workforce, but the value of your human capital isn’t going down anytime soon.
  2. ML IS INHERENTLY BETTER THAN HUMANS AT FINDING AND STOPPING INTRUSIONS: There are instances where this is true, but it is not universal. To be effective, solutions which are implemented using ML need to be trained, tuned, and tested by domain experts who understand the problem. Solutions which lack this rigor tend to be ineffective, noisy, and ultimately shelved by users.
  3. IN ML, ALGORITHMS ARE THE MOST IMPORTANT FACTORS: We hear a lot about deep learning, neural networks, and a host of other fancy sounding terms. Most people don’t know what these words mean, but they sound great! The truth is that the quality of the data you feed to any ML technology greatly determines the real-world efficacy of the resulting model. Data curation, cleaning and labeling is hard and extremely time consuming. It requires deep domain knowledge and collaboration between data scientists and security experts.
  4. ML IN SECURITY IS ONLY ABOUT DETECTION: It is true that ML is an excellent way to solve some detection problems, but security is much bigger than your appliance blocking or finding something bad. False positives happen. An alert rarely tells the entire story of an attack. Security teams often need help deciding what to do next and how to respond. The industry’s narrow focus on ML only for detection has hindered potential advances in other key areas such as triage, response, and workflow.

Advantages of Machine Learning in Security

Despite these challenges with ML, when properly implemented, ML can be powerful in security. In fact, when used correctly, it has significant advantages over non-ML approaches in solving certain problems. These include:

  1. GENERALIZATION: Models can generalize to never-before-seen techniques, sequences of adversary actions, or malware samples based on structural or behavioral relationships within the data that are not obviously constructed by hand. Human-constructed signatures or heuristics tend to be very specific and reactionary, giving rise to subsequent high false negative rates because of evolving adversary tools and tradecraft. ML solutions, if done right, can consistently detect never-before-seen evil.
  2. SCALE AND AUTOMATION: ML solutions scale well with increasing volumes of data, a pervasive problem in the industry. They can aggregate, synthesize, and analyze disparate data sources automatically. When a novel attack occurs or new malware sample emerges, just adding it to the training set can improve the model. An army of malware analysts and signature developers is not necessary.
  3. DEEP INSIGHTS: ML learns from the data what constitutes malicious and benign content or behaviors. A human does not dictate exactly where the decision boundary between benign and malicious lies. This can lead to surprising and sometimes non-intuitive ways the underlying data can be sliced to make detection decisions – things which would not occur to a domain expert.
  4. INFREQUENT UPDATES: A need for constant signature updates can be a substantial operational burden and put systems, especially offline or off-corporate-network systems, at rapidly increasing risk of compromise. Well-constructed ML solutions generalize well to new threats and thus require far less frequent updates to be effective.

Considerations When Building ML/AI Models

Endgame has a number of powerful ML-based detection solutions in our product, with many more in development. Where we have chosen to implement ML, the advantages are significant. However, in building these capabilities, we are always extremely aware of both the pitfalls (ensuring we avoid them) and opportunities (ensuring we optimize them) inherent within ML-based solution. For others looking to evaluate a ML model, or thinking about building their own, I’ve compiled some of the key considerations to keep in mind when building ML-based solutions.

  1. GARBAGE IN/GARBAGE OUT: This is the most significant issue facing security researchers building models. First, gathering the right representative data, both malicious and benign, is a huge challenge. Any model will have holes in its global understanding of data and behavior, but the bigger the holes, the worse the model will perform in the real-world. In addition, unsupervised machine learning - where the model alone derives inferences from the data without human labeling - generally is inadequate for most security use cases. Instead, most models are built through supervised machine learning – that is, feeding the tool a lot of training data which is labeled either good or bad. The model then learns whether to call future unlabeled content or behavior fed to it by the security application for evaluation good or bad. Labeling of training data may sound simple, but in practice it is very hard. There are significant edge cases, like how to deal with things like adware, or legitimate tools which can look like malware that does similar things, such as remote access solutions. Worse, attackers can and do sneak into the benign training set, usually not through some conscious effort but through labeling laziness or mistakes. A file has zero malicious detections in VirusTotal? Scoop up that piece of advanced malware, call it benign, and your model will ignore it forever.
  2. FEATURE ENGINEERING: Models are usually built upon features which describe the data. Feature extraction is the process by which input data, such as files, are transformed deterministically into a representation of that data comprising many features. The features are engineered to encode domain knowledge about the problem to be solved. For example, Windows executable files are transformed into thousands of floating point numbers during our MalwareScore™ feature extraction process. Feature engineering, that is, researching and implementing the right feature set, is an enormously important part of the model building process. To generate features that can generalize to the future, an understanding of how and why things work is required: operating systems, networks, adversary tradecraft, etc. For example, we can simply make a feature that represents whether a binary imports functions for keylogging, but what if an adversary decides to dynamically load functions associated with keylogging to evade detection? The feature space needs to be diverse and account for evasion techniques to generalize well. Featureless learning is an exciting area of research, but it is nascent, and just as vulnerable to the garbage in/garbage out problem. In many security domains (e.g., static malware detection) hand-crafted features still represent the state of the art, allowing designers to encode decades of domain knowledge that have yet to be replicated through end-to-end deep learning.
  3. USERS WILL DO THINGS YOU DO NOT EXPECT: Your model needs to be false positive intolerant to have a chance at being successful in production. Many models look great when tested against known data, but when deployed they explode with false positives. This is often because users don’t act in predictable ways. Administrators may create new accounts, bounce between systems aggressively, and use tools, such as Powershell, which are often co-opted by hackers. Someone from accounting might log in at unpredictable times or practice new things on her computer discovered online. Software your model doesn’t know about might get installed. Researchers must be aware of these issues and expect the unexpected in the real-world. In addition, researchers must be aware of scale. A 1 in 10,000 false positive rate might seem great, but what if you’re observing a billion events per day?
  4. OVERFITTING: Overtraining isn’t only an issue at the gym. It is a real issue in data science. It is possible to build a model that is tuned exceptionally well at detecting known data but won’t extrapolate well for unknown data. This leads to a loss of generalization and sub-par, real-world efficacy. Researchers need to avoid falling into a trap of seeking perfection on results against a training set and instead constantly assess model performance against representative slices of withheld data; that is, data which the model did not test on in training. This helps avoid the trap of performing perfectly on the known data set, while failing when applied to real-world, unknown data.
  5. MISLEADING METRICS: This is mostly an industry problem, but also is a potential issue for internal teams building and “selling” custom models as they seek funding and production deployment on their own networks. There are always tradeoffs in model performance. Researchers and engineers must select cut offs between good and bad. It’s quite easy to manipulate the numbers and market a certain level of performance. For instance, it’s easy to detect 99.99 % of all malware, but at what false positive rate? Probably way too high. This is a dangerous but unfortunately common behavior by vendors, and the impact is compounded by how hard it can be for customers and users to validate claims in this industry. Test against real, representative data, be transparent, and demand that solutions submit to third-party validation whenever possible.
  6. BURN-IN TIME: Some solutions require months to tune an environment’s baseline of normal, otherwise known as burn-in time. Beyond forcing users to wait a few months before there is any value, a lengthy burn-in time increases the opportunity for malicious data or behavior to creep into the benign training set and does not eliminate the issue of holes in understanding what is normal. Efforts should be taken to minimize burn-in time or avoid it all together.

Moving Beyond Detection

In addition to these considerations, we must also hold up the actual use case to tight scrutiny. To date, ML is largely only applied to detection. For those who spend their time protecting systems day in, day out, you may by this point be asking yourself, “do I even need more alerts?” “Is detection even my biggest problem?” I find that the answer to both questions is often no. Security products need to be much broader than pure detection. They need to provide context, tie information together, and guide the practitioner through incident triage, scoping, and response. Today’s products generally are severely lacking in these areas. It is time to consider how AI and ML can improve the workflow including, but beyond, detection.

ML and AI can help automatically gather and correlate data, can suggest or even automatically execute response actions, remember what you did before, and make the process of asking questions of a mountain of available data straightforward and accessible to users. We need to collectively think bigger about AI in security and demand tools which apply AI and ML not just to create more alerts about things that might be malicious, but make the process of doing our jobs as security analysts easier.

At Endgame, through our own experiences as operators and analysts, we know the key pain points and hurdles, and bake ML into our endpoint security product in effective ways. Our research and development team combines experienced data scientists, reverse engineers, incident responders, and threat experts hailing from diverse industry and government backgrounds. Because domain expertise is so critical, security domain experts are paired with data scientists to craft, evaluate, and red-team solutions. We constantly question our assumptions, endlessly seek and fill gaps in our data sets and features, and get way into the weeds in analyzing our results to identify and fix potential shortcomings. The fruits of this tireless labor include our ML-powered MalwareScore™ feature, which we’d put up against anyone in the industry (and you can check it out yourself in VirusTotal).

We are also very comfortable in saying that ML is not the answer to all our detection woes. Our researchers and engineers have built world-class capabilities our customers use every day to block exploits, stop process injection, find injected code, eliminate ransomware, and much more. The techniques we use are very powerful and no less interesting or revolutionary because of the absence of ML – our growing patent portfolio and publicly shared research in these areas attests to this.

We also treat the usability and workflow problem as a serious R&D challenge in need of a solution. By pairing researchers with user experience experts, our product addresses those key workflow pain points with novel, yet intuitive to use, ways to fix them. The result is Artemis®, our AI powered chatbot which eases search, triage, and response throughout our endpoint protection platform. This does not eliminate the need for highly skilled users, but it does facilitate training of new users and removes friction for more advanced users. Under the hood remains a powerful two-way API which is fully documented and open for any user, extending our commitment to usability and extensibility.

The Bottom Line

Looking beyond the hype and buzzwords, when implemented well ML can be a powerful piece of a security program or tool. The key challenge is to build it correctly, taking into account the range of factors I’ve addressed in this post. The marketplace is littered with poorly implemented ML-based solutions fronted by aggressive marketing. To help cut through the noise, the graphic below provides a cheat sheet for key questions to ask when evaluating ML-driven solutions.


The key to effective implementation of ML is domain expertise. In security, that domain expertise needs to drive dataset curation, labeling, feature selection, and evaluation. It also is equally effective to apply that domain expertise directly via other methods of detection. You need real time, inline analysis of data and actions on monitored systems looking for behaviors across the range of adversary techniques. ML isn’t the answer when domain expertise is required to go deep into the kernel and boil adversary actions like process injection down to their essential essence and block that in real-time.

Most importantly, remember that machine learning is a powerful tool, but is not inherently good or even better than alternatives. The appropriate use case, data, parameters, and domain expertise (just to name a few of the challenges) all impact the efficacy of ML-based solutions. Moreover, beyond improving detection, as a community we must think bigger and address our industry’s usability problems and how ML could help alleviate some of the major workflow challenges. To this end, at Endgame we will continue to research and operationalize powerful new features, both ML and non-ML based. Stay tuned!