Customers

Elasticsearch as an Android Malware Research Platform

Our team is responsible for mobile threat research at Trend Micro. Daily work of team members consists of mobile application analysis, virus signature creation, and vulnerability research. Virus signature is the key output which is then integrated with Trend Micro's mobile product to protect our customers from mobile ransomware, banking Trojans, and other types of threats.

As of 31 January 2017, we have collected more than 40 million Android applications and of that 20 million have been judged as malware or PUA (Potentially Unwanted Applications). We sourced tens of thousands of APK (Android application package) files each day. How to quickly and effectively find malware in so many applications is a big challenge to the team.

Beginning in the second half of 2014, we introduced the Elastic Stack into our backend system. ELK and since that point, the Elastic Stack has a played a very important role for us. We use it to:

  1. Analyze cloud scan logs from our products. This allows us to learn about the threat status in real world.
  2. Analyze logs from dozens of backend modules
  3. Monitor OS and application statuses in backend
  4. Create a platform for researchers to hunt malware

The following blog will focus on our efforts concerning #4.

The Architecture

Each Android application will pass a number of analyzers after it enters our system. In turn, each analyzer extracts some attributes from the application. We send these attributes to MongoDB using the hash of the application as the key. Then other modules can fetch them quickly. But if we want to get all the applications which have some specific attributes, MongoDB cannot help here because it is not possible to create index for all the attributes. With Elasticsearch, however, we can easily meet this requirement with all attributes synced and indexed directly into our Elasticsearch 2.4 deployment.

1.png

In each cluster, one document represents one application. We use the SHA256 hash as the document id. Finally, a web interface facilitates the search and aggregation queries against data in these clusters.

The Use Cases

Case 1: Hunt malware with similar package

To introduce this case, let me provide some background information first. Each Android application is assigned a package name. For instance, we can get the Facebook application on Google Play store with this URL: https://play.google.com/store/apps/details?id=com.facebook.katana with "com.facebook.katana"as the package name.

But package name is not exclusive as different applications can use the same package name as long as they are not installed on the same Android device.

With that background in mind, let's look at a specific example of how malware authors can exploit this.

Adobe ceased development on Flash Player for Android  in 2012 and from then on, the official Flash Player application was no longer available on Google Play.

Now, let's dig a little further.

Although Flash player has been abandoned by the Android platform, some mobile users still want it to watch the videos in Flash format. Malware authors take advantage of this and create malicious applications with the same or similar package names as the original Flash Player. Careless mobile users are easily deceived to install these malware apps. To verify this, let us check our cluster with below query.

{
  "size": 0,
  "aggs": {
    "package": {
      "filter": {
        "fuzzy": {
          "PackageName.raw": {
            "value": "com.adobe.flashplayer"
          }
        }
      },
      "aggs": {
        "count": {
          "terms": {
            "field": "PackageName.raw"
          }
        }
      }
    }
  }
}

The fuzzy query is to get all the applications with package name similar to "com.adobe.flashplayer" then terms aggregation categorizes them. The output is very interesting. "con.adobe.flash.player", "com.adobe.flashplyer", "com.Adobes.flashplayer" can you distinguish them?

[
  ...
  {
    "key": "con.adobe.flash.player",
    "doc_count": 6
  },
  {
    "key": "com.adobe.flash.player",
    "doc_count": 4
  },
  {
    "key": "com.adobe.flashplayere",
    "doc_count": 4
  },
  {
    "key": "com.adobe.flashplyer",
    "doc_count": 4
  },
  {
    "key": "com.Adobes.flashplayer",
    "doc_count": 3
  },
  {
    "key": "com.adobe.flashplayer2",
    "doc_count": 3
  },
  {
    "key": "com.adobe.ftashplayer",
    "doc_count": 3
  },
  ...
]

Below is the set of these similar package names.

2.png

Case 2: Hunt malware with key string

Below is a piece of reversed Java code from a malicious application. The code itself is quite straightforward. The malware obtains the phone number with the Android API getLine1Number() and collects all the short text messages in the function call getSmsInPhone() , then send these information to "13679818674@139.com".

3.png

"13679818674@139.com", "phone", "mailMsg:" in above code are const strings of the application. Our analyzer already extracted all these string and indexed them to a single document with Elasticsearch. Now we want to find all the applications using the email account "13679818674@139.com". It is very possible that all of them are malicious and were created by the same bad actor. A simple query can help us determine this.

{
  "fields": [
    "_id"
  ],
  "query": {
    "term": {
      "strings.raw": "13679818674@139.com"
    }
  }
}

Note that during the indexing, we added the not-analyzed field strings.raw.

Case 3: Hunt variants with string similarity

String is an important feature of any application. Variants of one malware family normally share some common strings. For example, the analyzed terms in the above example, "13679818674" & "mailmsg", represent the family (we use the default analyzer during indexing). Based on string similarity, we can find new variants. The MLT (More Like This) query fits this scenario perfectly. For instance, imagine that we want to find all applications that are similar to a malware dd936922f618b425efd0128aae5967eb13460f4444abd23f53ef9db8589bd9cd (this is the sha256 hash of the malware). A MLT query could be generated.

{
  "fields": [
    "_id"
  ],
  "query": {
    "more_like_this": {
      "like": [
        {
          "_index": "string-2017.01",
          "_type": "strings",
          "_id": "dd936922f618b425efd0128aae5967eb13460f4444abd23f53ef9db8589bd9cd"
        }
      ],
      "minimum_should_match": "60%",
      "max_query_terms": 50,
      "min_word_length": 3
    }
  }
}

Parameters like "max_query_terms" and "minimum_should_match" could be adjusted to improve the accuracy. Hit documents with the high score values are the very possible variants.

For instance, "9711534f3c23a6de4a3673c6db1a2fd09389dd2fbe4fb8e2fa715cfa6deb3682" is in the returned list. We compare the code hierarchy with the original one. It is obvious that this is a new variant adding a new class "g0us_sD ".

4.png

5.png

6.png

Note that in this case, we use the analyzed String field. To debug which terms are used for the similarity calculation, we simply use validate API with "rewrite" set to "true".

More …

Above are some simple cases. With the strong capability provided by Elasticsearch, more advanced tasks are accomplished on the platform including:

  1. correlating different attributes to create more accurate sample hunt queries
  2. creating rules to monitor application with specific characteristics
  3. speeding up the validation and release process of our virus signature

Furthermore, we designed other clusters to index the preprocessed application code. This gives us new ways to dig into the behavior of the application. Implementing this whole thing is something we are undertaking right now and can't wait to share in a later post.

As a result of the success we've had in Android application, we have extended the Elastic Stack to handle iOS and Mac application as well. We are also studying the new fantastic features of Elastic Stack and piloting the new version in some additional small systems.


author-kenny-ye.jpg

Kenny Ye
Senior staff engineer in Trend Micro. Technical leader of Mobile Threat Response Team.