Editor’s Note: Elastic joined forces with Endgame in October 2019, and has migrated some of the Endgame blog content to elastic.co. See Elastic Security to learn more about our integrated security solutions.
Last year, we introduced Endgame MalwareScore®, a machine learning malware detection and protection engine for Windows Portable Executable (PE) files. Since its release, MalwareScore has proven capable in detecting emergent malware and resilient against bypass. For example, recent research from Recorded Future stated that MalwareScore was the only classifier in VirusTotal that detected malware signed with particular counterfeit certificates. Beginning today, MalwareScore now supports macOS. We are extremely confident in our new Mac support, which has been released in VirusTotal for the world to see!
Mac support is a major enhancement which required overcoming many challenges that we didn’t encounter when we created MalwareScore for Windows. We faced three primary challenges during development. First, there is a lack of good open source tools for parsing Mach-O (binaries run on macOS) files. Second, our internal data pipelines and infrastructure were built with the specific use case of processing PE files (Windows binaries). Third, and most difficult, we had limited training data. This post will walk through each of these challenges and how we overcame them.
Lack of Open Source Mach-O Parsers
When looking at Windows PE files, there are tools that do a tremendous job parsing the files. We instantly gravitated towards Ero Carrera's PEFile parser for our initial R&D efforts, allowing Endgame data scientists to jump right into in-depth feature engineering. Once we were ready to productionize our Windows classifier, we implemented our own highly optimized PE parser to ship with our product. Unfortunately, there is no parallel in the world of Mac Mach-O files. There are some nice command line utilities such as Jonathan Levin’s JTool for parsing Mach-O files, but they weren’t built for our use case and would not scale to the order of millions. Other tools like Quarkslab’s LIEF have a lot of potential, but were not mature enough when we first started our research into a Mach-O classifier. Ultimately, we had to roll our own static parser to best support our research.
Fortunately, there are a lot of good references for learning about Mach-O file formats. We recommend reading the Mach-O file format reverence and Jonathan Levin’s book macOS and iOS Internals, Volume III: Security & Insecurity. We used these heavily in implementing our static parsers and as a reference during feature engineering.
Retooling Our Data Pipeline
There are significantly fewer unique Mac Mach-O files in the wild than PE files, so scale in our pipeline was of little concern. However, there were small details that forced significant changes in how we process, store, and format data. The first detail that required refactoring was the paradigm of Mach-O Universal Binaries (or Fat files). These files are essentially a small header with one to many standard Mach-O files concatenated together. The issue is that it's possible to have both malicious and benign Mach-O files contained in the same Fat file! Additionally, Endgame only supports Intel architectures (macOS), so other architectures such as ARM-based iOS and older Mac architectures using PowerPC packaged in the same Fat file are of little to no use to us in our training data. Sifting through all this created some complexity we didn’t anticipate going into the effort.
In addition, much of our pipeline is backed by SQL, which makes for fast and expressive querying, but does not handle fundamental changes in data formatting. Instead of bolting on a change, we fundamentally changed our table structures for better support of new file types. Extensibility gives us the option to extend to new types in the future but, of course, led us to a huge data migration. As anyone that has done data migration will tell you, it is a very long and painful process, especially when you have deadlines to meet!
Another seemingly small but important detail is that of magic bytes. Magic bytes occur at the beginning of a file so that their type can be inferred from something other than the file extension. For example, Fat Mach-O files begin with 0xcafebabe. Unfortunately, Java class files also start with the same four bytes. This means a deeper parsing needs to occur when data is being mined. A naïve approach of just pulling in files with those magic bytes will leave you with a large amount of useless data, and you’ll wonder why the number of samples with successfully extracted features is much smaller than expected (I speak from experience!).
Lack of Training Data
Many of us were skeptical when we first started out on this journey to create MalwareScore for Mac. Our primary concern was the lack of data. More specifically, we were legitimately concerned about a lack of malicious data, and rightly so. To demonstrate the stark difference in data availability between PE files and Mach-O files, we took three day’s worth of data in VirusTotal for both PE files and Mach-O files (Intel Architecture) and compared the breakdown of malicious versus benign. We define malicious files as those with at least five detections. This is not a perfect labeling scheme, but good enough for a high-level analysis and for demonstration purposes.
The two pie charts below demonstrate the general disparity in file types and are scaled to represent the total number of samples for each file type (2.7 million PE files and 94,000 Mach-O files). There are two key conclusions to ascertain from these charts. First, there are far more files in VirusTotal for Windows than Mac. Second, there is very little Mac malware. About 46% of PE files submitting during this time are malicious while 1.5% of Mach-O files are malicious.
The pie chart on the left reflects the malware distribution for PE files, while the smaller one on the right reflects the Mac distribution. The size of the pie charts reflects data availability and the small population problem when classifying Mac malware.
We took several steps to help ameliorate this situation. As is common practice for imbalanced learning problems, we incorporated class weights when training our classifier. However, we discovered that adjusting the class weights to simulate a fifty-fifty class balance was suboptimal in our experiments. Instead, we performed a grid search on class weights and discovered that our problem preferred much stronger weighting of benign files, while still achieving high true positive rates.
We also expanded data sources beyond narrowly scoped or biased sample distributions. As researchers, we must fight the urge to only grab as much data as we can from VirusTotal and other similarly biased data sources to build our classifiers. This is obviously easier said than done, but was very important for us to do. Three easy ways to potentially expand data sources include grabbing Mach-O files from a clean and freshly installed macOS, adding known benign open source software, and incorporating customer data. The problem with these solutions is that they do little to help class imbalance as they heavily favor benign data. Without adding additional and diverse malicious Mach-O files, your classifier is likely to overtrain and not generalize to new malware. You’ll need to be more creative to make a production-level classifier!
Endgame MalwareScore® for Mac is now live in VirusTotal! This release was the culmination of lots of hard work by many engineers, data scientists, and other researchers. This post outlined several challenges we faced during its development and we hope it will help others as they extend their machine learning AV to support Mac. We are proud of Endgame MalwareScore® and now MalwareScore for Mac, and believe it is important to provide transparency into the range of data challenges we encounter in building product-grade machine learning malware classifiers. In the coming weeks we’ll post additional details on performance!