Putting the MITRE ATT&CK evaluation into context

Editor’s Note: Elastic joined forces with Endgame in October 2019, and has migrated some of the Endgame blog content to See Elastic Security to learn more about our integrated security solutions.

Today, MITRE published the results of their first public EDR product evaluation. This effort was a collaboration between MITRE and seven EDR vendors to understand how various products can be used to provide security teams with visibility into post-compromise adversary techniques. In the test, MITRE executed a set of techniques using open source methods mirroring previously-observed APT3 techniques. In their write-up, they’ve supplied information about how vendors provided alerting and/or visibility into data associated with their execution of a technique.


This is an extremely valuable contribution to the infosec community. Frank, Katie, Blake, Chris and others at MITRE should be applauded for all the hours and energy they poured into generating this groundbreaking body of knowledge. The testing was well organized, the data captured thorough, and the finalization of results fair and collaborative. That last point is especially noteworthy given the huge amount of nuance and inherent lack of any one universal “right way” to address much of ATT&CK. This evaluation is a great achievement from MITRE, and we look forward to working with MITRE on continually refining the process and participating in future tests.

As we reflect on the test and what it means, we would like to add some perspective to put the results into context.

Why the MITRE ATT&CK evaluation is valuable and important

Product testing is not new. Endgame is a participant in public testing and an active member of the Anti-Malware Standards Testing Organization (AMTSO). Transparency and openness are foundational Endgame operating principles. Not being afraid of competitive testing and evaluation is a necessary part of that, despite every independant test having different imperfections. We welcome it.

What is new about this test is that it entirely emphasizes post-compromise visibility. Depending on which way you look at it, that’s either intentionally ignored or a complete or near-complete blind spot for public evaluations until now. This matters. Why?

The community has become increasingly aware that it’s not all about exploit and malware blocking. Adversaries can perform operations using nothing but credentials and native binaries. Whether from a vendor or a result of home-grown detection engineering, none of our detections or protections are immune to bypass, no matter anyone’s claims. Organizations need to assume they’re breached and build security programs which allow for the discovery of active attackers in the environment.

MITRE ATT&CK is by far the best, most authoritative knowledge base of techniques to consider in building a detection program which includes the “assume breach” concept. All organizations require tooling to give them data and detection capabilities, whether they build their own or, as most do, work with one or more vendors to provide data gathering, querying capabilities, alerting, and other components.

The ATT&CK product evaluation provides a good reference dataset highlighting various methods of detection. It starts to move towards a taxonomy describing types of detection and visibility - the taxonomy MITRE has given us is complex and perhaps imperfect, but that’s reflective of the problem as a whole. It’s not a simple yes/no answer or a numeric score, like typical tests which measure whether a piece of malware was blocked or not. Most importantly, the evaluation moves us forward in emphasizing the fundamental importance of data visibility when it comes to building a program and considering tooling.

The MITRE evaluation isn’t everything

The evaluation provides a massive amount of data and people will naturally wonder how to action that information. As we’ve described before (and we’re not the only ones), ATT&CK is not a measuring stick. It’s a knowledge base. Trying to use it as a universal, quantitative measurement device is a recipe for failure.

We could probably spend entire posts delving into each of these items, and this list isn’t comprehensive, but some of the pitfalls and challenges inherent to trying to quantify ATT&CK include:

  • Not considering real world scenarios. In the real world, you don’t need to detect or block every component of an attack to disrupt an adversary or remediate an action. We build layered behavioral preventions and detections for our customers. These layers, working together, provide a vanishing probability of missing a real attack, even if we know it’s likely we won’t alert on every action taken in an attack. We know individual protections will sometimes miss or be bypassed. Similarly, incident responders will tell you that it’s a pipe dream if you ever imagine you will have a completely airtight picture of every technique used by an adversary in a known breach. 100% visibility is not necessary for effective remediation.
  • Lack of prioritization or weighting of techniques. Is deep, signatureless coverage of process injection more important than knowing that an attacker base64 encoded something on an already compromised box? For any enterprise team I can conceive of, yes, injection coverage is dramatically more important. There’s no notion of prioritization between techniques in ATT&CK. See this post we did last year for a deeper dive into ways technique coverage could be prioritized by teams according to their particular threat landscape and interests. MITRE hasn’t included prioritizations for a reason: it is not a weighted measurement tool, it’s a knowledge base. Turning it into a score sheet can be counterproductive.
  • ATT&CK is incomplete. MITRE does a great job updating ATT&CK as new techniques become known. This regularly happens due to white hat security research, adversary evolution, and new threat reporting. ATT&CK is by definition always behind the cutting-edge in the real world, and it has gaps. Level of specificity in a given technique also varies widely. We are excited about future decomposition of techniques into sub-techniques, as there are usually a number of known methods to invoke a single technique. In this particular evaluation, you’ll note some cases where MITRE chose a few different ways to implement a single technique. This is good and reflective of reality. But, there are huge number of untested alternative implementations even for the techniques used in this evaluation. Testing everything would be nearly impossible.
  • Noise in production. Is an alert better than telemetry? Sometimes yes, sometimes no. The majority of the activity described in ATT&CK is seen in most enterprises on a daily basis. We cannot seek alerting coverage across all of ATT&CK. It would overwhelm security teams with noise and FPs. Taking that idea further, we shouldn’t even overextend in an attempt to provide visibility to every cell - there are diminishing returns in the real-world in doing so.
  • Data robustness. Not all data is created equal in terms of enrichments and hardening against adversaries determined to get around your EDR solution. There’s a growing body of research around this topic, for example this excellent talk by William Burgess called “Red Teaming in the EDR Age.” We highly recommend it and similar work to anyone considering visibility. Many common sources of EDR data can be undermined by an attacker with access. At Endgame, we put a lot of effort into hardening our datasources. Not all EDR vendors do the same. This is an important factor but one which would not be easy to measure in an evaluation.
  • Evaluating the tool or the Team? For a nuanced evaluation such as this, some amount of expertise and knowledge is required. In the MITRE evaluation, vendors were invited to deploy, configure, and participate in the evaluation on the blue team side. This makes tremendous sense, as MITRE had enough work to do beyond overcoming the often steep learning curve of the various EDR products. Endgame takes great pride in how our customers can consume and make use of advanced capabilities compared with the deep expertise and expertise required for other tools in this space. Assessing usability and accounting for a security team’s expertise would be very hard in an evaluation.
  • Not a full product assessment. Visibility is one important component of any endpoint security tool. Other important components include prevention, hardening (discussed above), response, usability, and a host of considerations around topics like deployment, endpoint impact, network impact, and more.

None of this is intended as a criticism of MITRE’s evaluation. In fact, they’ve taken care not to overstate what the test is by providing information about evaluated products that is narrowly scoped around post-compromise visibility. They haven’t attempted to score or rank vendor products, and neither should we.

Even teams new to ATT&CK should be working to incorporate it into their security program. There is a lot to consider, but there are ways to get started by taking small bites out of the huge ATT&CK sandwich. We’ve recently written about this topic, with some of that information available here.

What about Endgame’s evaluation?

We are pleased with how the evaluation describes our capabilities. Our agent provides visibility into the vast majority of techniques tested by MITRE in the evaluation, using a good balance of alerting behavioral detections and straightforward visibility into activity via our telemetry. Some of the noteworthy items in the results include:

  • ATT&CK Integration. The results showcase our product’s long standing ATT&CK integration where behavioral detections are linked to ATT&CK.
  • Access to Telemetry. MITRE’s results detail our interactive process tree, Endgame Resolver™. Telemetry is easily visible from this tree. It’s not readily apparent from static screenshots, but the entire tree is interactive and response actions can be taken right from the tree.
  • Enrichments. Custom enrichments are shown for ATT&CK-relevant items that didn’t make sense for alerting. For example, execution of ipconfig doesn’t create alerts on its own, but if it is related to processes with higher confidence alerting, the potential security relevancy of that ipconfig execution is highlighted for the user.
  • Memory Introspection. In-memory artifact capture is also showcased in the evaluation, with artifacts such as strings present in injected threads automatically captured for inspection.
  • Everyone Has Gaps and Differences. Some visibility gaps exist, and for most of those, we already have robust solutions in flight. For example, our customers will be excited to see enhanced network data capture in our next monthly release. In this ATT&CK evaluation, none of these gaps are news to us and we have some disagreement reflected in the Notes about whether some are actually gaps versus differences in evaluator expectations and workflow. That said, we look forward to continual assessment and relentless improvement.

What’s next?

We’re proud to have participated in this evaluation and look forward to participating again, should MITRE continue to lead evaluations. We look forward to continued collaboration with MITRE on ways to design and run both this evaluation and other competitive testing through our participation in AMTSO. And, we’ll continue to contribute to the community’s overall understanding of how to build a security program, including how to operationalize ATT&CK. And, of course, we’ll keep building and enhancing the Endgame platform for our current and future customers.