Editor’s Note: Elastic joined forces with Endgame in October 2019, and has migrated some of the Endgame blog content to elastic.co. See Elastic Security to learn more about our integrated security solutions.
Machine learning is a fashionable buzzword right now in infosec, and is often referenced as the key to next-gen, signature-less security. But along with all of the hype and buzz, there also is a mind-blowing amount of misunderstanding surrounding machine learning in infosec. Machine learning isn't a silver bullet for all information security problems, and in fact can be detrimental if misinterpreted. For example, company X claims to block 99% of all malware, or company Y's intrusion detection will stop 99% of all attacks, yet customers see an overwhelming number of false positives. Where’s the disconnect? What do the accuracy numbers really mean? In fact, these simple statistics lose meaning without the proper context. However, many in the security community lack the basic fundamentals of machine learning, limiting the ability to separate the relevant and credible insights from the noise.
To help bridge this gap, we're writing a series of machine learning-related blog posts to cut through the fluff and simplify the relevant fundamentals of machine learning for operationalization. In this first post, we provide a basic description of machine learning models. Infosec is ripe for a multitude of machine learning applications, but we’ll focus our overview on classifying malware, using this application to demonstrate how to compare models, train and test data, and how to interpret results. In subsequent blog posts, we'll compare the most prominent machine learning models for malware classification, and highlight a model framework that we believe works well for the malware-hunting paradigm on a lightweight endpoint sensor. While this post is focused on malware classification, the machine learning fundamentals presented are applicable across all domains of machine learning.
What is Machine Learning?
In general, machine learning is about training a computer to make decisions. The computer learns to make decisions by observing patterns or structures in a dataset. In machine learning parlance, the output of training is called a model, which is an imperfect generalization of the dataset, and is used to make predictions on new data. Machine learning has many advantages, automating many aspects of data munging and analysis at scale. For example, executables are either benign or malicious, but it’s impossible to manually review all of them. A corporate system may contain millions of files that require classification and few, if any, companies have enough staff to manually inspect each file. Machine learning is perfect for this challenge. A machine learning model can classify millions of files in minutes and can generalize better than manually created rules and signatures.
Supervised learning models are trained with examples to answer precisely these kinds of questions, such as, “is this file malicious?”. In this supervised learning setting, the training set may consist of two million Windows executables consisting of one million malware samples and one million benign samples. A machine will observe these samples and learn how to differentiate between benign and malicious files in order to answer the question. Typically, this decision is in the form of a score such as a single value between 0 and 1. The figure below demonstrates the creation and use of a supervised model.
The model’s scores are converted to yes/no answers by way of a threshold. For example, if our scores ranged from 0 to 1, we may want to set a threshold of 0.5. Anything less than 0.5 is benign (“not malicious”), and everything equal or greater than 0.5 is malware (“is malicious”). However, models are rarely perfect (discussed later) and we may want to tweak this threshold for a more acceptable performance. For instance, perhaps missing malware is far worse than labeling a benign sample as malware. We might set the threshold lower, say 0.3. Or, maybe mislabeling in general is very bad, but our use case will allow for outputting unknown labels. In this case we can set two thresholds. We could choose to set everything below 0.3 as benign, everything above 0.7 as malicious, and everything else is unknown. A visualization of this is below. The important point here is that there is no one-size-fits-all solution, and it is essential to understand your data well and adjust models based on the use case and data constraints.
How do we evaluate models?
Metrics are necessary to compare models to determine which model might be most appropriate for a given application. The most obvious metric is accuracy, or the percentage of the decisions that your model gets right after you select appropriate thresholds on the model score. However, accuracy can be very misleading. For example, if 99.9% of your data is benign, then just blindly labeling everything as benign (no work required!) will achieve 99.9% accuracy. But obviously, that's not a useful model!
Results should be reported in terms of false positive rate (FPR), true positive rate (TPR), and false negative rate (FNR). FPR measures the rate in which we label benign samples as malicious. An FPR of 1/10 would mean that we classify 1 in 10 (or 10%) of all benign samples incorrectly as malicious. This number should be as close to 0 as possible. TPR measures the rate in which we label malicious samples as malicious. A TPR of 8/10 would mean that we classify 8 in 10 (or 80%) of all malicious samples as malicious. We want this number to be as close as possible to 1. FNR measures the rate in which we label malicious samples as benign and is the opposite of TPR (equal to 1-TPR). A 9/10 (9 in 10) TPR is the same as 1-9/10 or a 1/10 (1 in 10) FNR. Like FPR, we want this number to be as close to 0 as possible.
Using these three measurements, our results don't look so great for the useless model above (always labeling a sample benign) evaluated only on benign samples. Our FPR rate is 0% (perfect!), but TPR is now 0%. Since we never label anything as positive, we will not get any false positives. But this also means we will never get a true positive, and while our accuracy may look great, our model is actually performing quite poorly.
This tradeoff between FPR and TPR can be seen explicitly in a model's receiver operating characteristic (ROC) curve. A ROC curve for a given model shows the FPR / TPR tradeoff over all thresholds on the model’s score.
What FPR, TPR and FNR metrics really mean to a user greatly depends on scale. Let's say that we have a FPR of 0.001 (1/1000). Sounds great, right? Well, a Windows 7 x64 box has approximately 50,000 executables on a given system. Let’s say you have 40,000 endpoints across your network. Absent any actions to rectify the false positives, this model would produce approximately two million false positives if applied to all Windows executables on all systems across your entire network! If your workflow calls for even the most basic triage on alerts, which it should, this would be an inordinate amount of work.
Importance of Training and Testing Data
So is the key to good machine learning a good metric? Nope! We also have to provide our learning stage with good training and testing data. Imagine all of our training malware samples update the registry, but none of the benign samples do. In this example, a model could generalize anything that touches the registry as malware, and anything that does not touch the registry is benign. This, of course, is wrong and our model will fail miserably in the real world. This example highlights a phenomena known as overtraining or overfitting and occurs when our model becomes too specific to our training set and does not generalize. It also highlights a problem of bias, where our training/testing data over-represents a sub-population of the real-world data. For example, if the training data disproportionately is overpopulated with many samples from a small number of malware families, you may end up with a model that does a great job detecting new variants of those specific families but a lousy job detecting anything else.
Why is this important? Overfitting and bias can lead to misleading performance. Let's assume you are training your malware classification model on 5 families (benign + 4 different malware families). Let's also assume that you computed FPR, TPR, and FNR on a different set of samples consisting of the same 5 families. We'll also assume that your model is awesome and gets near perfect performance. Sounds great, right? Not so fast! What do you think will happen when you run this model in the real world and it’s forced to classify families of malware that it has never seen? If your answer is "fail miserably" then you're probably correct.
Good performance metrics are not enough. The training/testing data must represent and function in the real world if you expect similar runtime performance as the performance measured during test time.
So where do you get good train and test data? At Endgame we leverage a variety of sources ranging from public aggregators and corporate partnerships to our own globally distributed network of sensors. Not only do these sources provide a vast of amount of malicious samples, but they also aid in acquiring benign samples helping Endgame achieve a more diverse set of data. All of this allows us to gather real world data to best train our classifiers. However, your training data can never be perfect. In the case of malware, samples are always changing (such as new obfuscation techniques) and new families of malware will be deployed. To counteract this ever-evolving field, we must be diligent in collecting new data and retraining our models as often as possible. Without a constantly evolving model, we will have blinds spots and adversaries are sure to exploit them. Yes, the game of cat and mouse still exists when we use machine learning!
Now that your training/testing data is in the computer, we need to decide which samples are benign and which samples are malicious. Unlabeled or poorly labeled data will certainly lead to unsuccessful models in the real world. Malware classification raises some interesting labeling problems ranging from the adversary use of benign tools for nefarious means to new, ever-evolving malware that may go undetected.
- Unlabeled data. Data acquired from a sensor is often unlabeled, and presents a serious problem for machine learning. Unlabeled binaries require human expertise to reverse engineer and manually label the data, which is an expensive operation that does not scale well to the size of the training set needed for model creation.
- Incorrectly labeled data. Unfortunately, when classifying malware, the malware is often incorrectly labeled. Whether labeled by humans or machines, the necessary signatures may not be accessible or known, and therefore by default the malware is classified as benign. This will ultimately confuse the classifier and degrade performance in terms in FPR, TPR, and FNR.
- Nebulous or inconsistent definitions. There is no consistent or concrete definition of what constitutes malware. For example, it is common for attackers to use administrative tools to navigate a system post-attack. These tools aren't inherently malicious, but they often have the same capabilities as their malicious counterparts (e.g., the ability to send network commands, view processes, dump passwords) and are often used in attacks. On the flip side, they are readily available and often used on a system administrator's machine. Constantly calling them malicious may annoy those administrators enough to entirely ignore your model's output. Decisions have to be made on what labels, and how much weight, to give these sorts of programs during training to ensure models generalize instead of overfit, and support business operations instead of becoming a nuisance.
The table below summarizes some primary challenges of labeling and their solutions.
So remember, machine learning is awesome, but it is not a silver bullet. While machine learning isn’t a silver bullet and signatures and IOCs are not going away, some of the most relevant, impactful advances in infosec will stem from machine learning. Machine learning will help leapfrog our detection and prevention capabilities so they can better compete with modern attacker techniques, but it requires a lot of effort to build something useful, and, like all approaches, will never be perfect. Each iterative improvement to a machine learning model will be met with a novel technique by an adversary, which is why the models must evolve and adapt just as adversaries do.
Evaluating the performance of machine learning models is complicated and results are often not as really, really, ridiculously good looking as the latest marketecture would have you believe. When it comes to security you should always be a skeptic and the same goes for machine learning. If you are not a domain expert just keep the following questions in mind the next time you are navigating the jungle of separating machine learning fact from fiction.
We believe that machine learning has the potential to revolutionize the field of security. However, it does have its limitations. Knowing these limitations will allow you to deploy the best solutions, which will require both machine learning and conventional techniques such as signatures, rules, and technique-based detections. Now that you are armed with a high-level introduction to interpreting machine learning models, the next step is choosing the appropriate model. There are a myriad of tradeoffs to consider when selecting a model for a particular task and this will be the focus of our next post. Until then, keep this checklist handy and you too will be able to detect data science fact from the data science fiction.