August 3, 2017 Engineering

Integrating Elasticsearch with ArcSight SIEM - Part 6 - Detecting Unusual Processes with X-Pack Machine Learning

By Dale McDiarmidMike Paquette

Following on from our most recent security alerting post, where we attempted to identify unusual processes using X-Pack Alerting, in this post we explore a more automated approach to the same challenge using machine learning.   The earlier post relied on effectively classifying any new process as unusual.  As highlighted in the closing footnote of that post, the approach does not represent an architecturally scalable or efficient means of identifying processes of interest -  rapidly leading to alert fatigue, unless other detection mechanisms and filtering are put in place e.g. restricting our query to a subset of servers of interest.  In practice, we can do much better.  

X-Pack machine learning (ML) capabilities include automated anomaly detection, which allows us to spot occurrences of values of a field that are rare within a set of processes started on a server.  Using the partition field capabilities of Machine Learning, we effectively create an independent statistical model per server.  Our  ML job will use the rare function to automatically gauge the level of rarity of a process simply by observing the relative frequencies of process names for that server over time. Processes that are truly rare will receive a high anomaly score, so rather than alerting on every new process, we can subsequently alert only on rare processes.  

Note that the machine learning engine is able to calculate the absolute probability (or improbability) of an occurrence being “normal,” but these values can be very small and difficult to interpret.   To help with comparing the unusualness of observed anomalies, the ML engine also calculates a normalized anomaly score, where anomalies are ranked on a 0-to-100 scale, which is further broken up into quartiles, which are color coded for easy visualization.  Anomalies scoring in the quartile containing the most unusual observations are referred to critical anomalies.  Users can specify a minimum value of this anomaly score for which to create alerts.

The ML Job we utilise below is well documented as an ML recipe here.  Whilst the recipe example utilises auditd logs, captured with Filebeat, we adapt the configuration to the CEF data provided in the previous post.

As a reminder, rare processes running on a host could be an indication of suspicious or malicious behavior.  For example an ftp process observed on a server that never runs ftp could be related to unauthorized access.  A rare mimi.exe process on a Windows system could likewise indicate the use of malware to steal credentials.

Data Overview

As a quick recap, our sample CEF data looks like:

CEF:0|Unix|auditd||EXECVE|EXECVE|Low| eventId=30275 externalId=1737 rt=1495907409681 categorySignificance=/Informational categoryBehavior=/Execute/Response categoryDeviceGroup=/Operating System catdt=Operating System categoryOutcome=/Success categoryObject=/Host/Application/Service art=1495990987103 cat=EXECVE c6a4=fe80:0:0:0:5604:a6ff:fe32:b64 cs1Label=dev cs2Label=key cs3Label=success/res cs4Label=syscall cs5Label=subj cs6Label=terminal/tty cn2Label=ses cn3Label=uid c6a4Label=Agent IPv6 Address ahost=test-server-1 agt=192.168.0.12 av=7.3.0.7886.0 atz=Europe/London at=linux_auditd dtz=Europe/London deviceProcessName=auditd ad.argc=3 ad.a1=vim ad.a2=/etc/filebeat/filebeat.yml ad.a0=sudo aid=3xrP+T1wBABCAA5ZTdRz+fA\=\=

Utilising the same auditd data subset as the watch in our previous post, we configure our ML job to process documents where the cat field contains the value EXECVE. The field ad.a0 represents the command issued, whilst ahost identifies the host from which the event originated. Finally rt, provides the start time of the process in epoch milliseconds.

The Logstash CEF codec, used to process this data with Logstash, maps the above fields to their standardised CEF form. The cat field is mapped to deviceEventCategory, rt to deviceReceiptTime, ahost to agentHost and the ad.* fields to ad.argc as a concatenated string.  The latter is subsequently processed with a filter to produce the field ad.a0.

Compared to the watch-based approach, ML requires larger datasets with which to build its statistical models.  We therefore provide a larger test set than the one we distributed with the previous post.  Fictitious anomalous processes have been injected into this dataset for purposes of example.  Instructions for obtaining this dataset, along with required Logstash configuration to ingest, can be found here.

Job Configuration

After downloading the files associated with the recipe, we need to make a few minor modifications to utilise the fields above.

Datafeed configuration

{
  "datafeed_id": "datafeed-unusual_process",
  "job_id": "unusual_process",
  "query": {
    "term": {
      "deviceEventCategory": {
        "value": "EXECVE"
      }
    }
  },
  "query_delay": "60s",
  "frequency": "300s",
  "scroll_size": 1000,
  "indexes": [
    "cef-auditd-*"
  ],
  "types": [
    "doc"
  ]
}

Job configuration

{
  "job_id": "unusual_process",
  "description": "unusual process",
  "analysis_config": {
    "bucket_span": "10m",
    "influencers": [
      "ad.a0",
      "agentHost"
    ],
    "detectors": [
      {
        "function": "rare",
        "by_field_name": "ad.a0",
        "partition_field_name": "agentHost"
      }
    ]
  },
  "data_description": {
    "time_field": "deviceReceiptTime",
    "time_format": "epoch_ms"
  },
  "custom_settings": {
    "custom_urls": [
      {
        "url_name": "Explore Process on Server",
        "url_value": "http://localhost:5601/app/kibana#/discover?_g=(refreshInterval:(display:Off,pause:!f,value:0),time:(from:'$earliest$',mode:absolute,to:'$latest$'))&_a=(columns:!(_source),index:'cef-auditd-*',interval:auto,query:(query_string:(analyze_wildcard:!t,query:'ad.a0:$ad.a0$ AND agentHost:$agentHost$')),sort:!('@timestamp',desc))"
      }
    ]
  },
  "model_plot_config": {
      "enabled" : true
  }
}

Notice we no longer need to create a “process_signature” field. Rather than simply looking for new processes, we configure the ML job to use the field “ad.a0” which describes the main process name. By choosing to partition on agentHost, we consider the process names with the respect to each host independently - thus avoiding the need to create a concatenated field.

Note that the custom_url setting in the ML job allows us to attach a url to each of our detected anomalies for subsequent investigation. The $field$ syntax allows values from the anomaly to be injected into the url, thus producing a unique destination.  Here we link to Kibana’s discover view, filtering the search to the influential server and process. The $earliest$ and $latest$ tokens are used to pass the time span of the selected anomaly to the target page. The tokens will be substituted with date-time Strings in ISO-8601 format, e.g. 2016-02-08T00:00:00.000Z, as used by Kibana when restricting to a time range. Whilst here we direct the user to the discover view, a dashboard or graph may be a more appropriate destination for other ML jobs.

Our detector is configured as follows:

1.png

As shown above, we look for rare occurrences in the field “ad.a0”, partitioning by agentHost. These fields are also selected as influencers. When each detected anomaly is assigned a normalized Anomaly Score, it will also be annotated with values of other fields in the data that have statistical influence on the anomaly.  Here we specify two candidate influencers - our host and process names.

The ML job configuration files can be loaded via the script provided.  Alternatively, the same job can be created through the UI, once the data is ingested.  Here we walk through the ML job configuration using the Machine Learning user interface within Kibana. The more detailed data ingestion steps can be found here.

  • 1. From the Machine Learning Job Listing, select “Create New Job”. Select “Create an advanced job”

2.png

  • 2. Complete the Index, Types and Time-field selection with the values “cef-audit-*”, “doc” and “deviceReceiptTime” as shown below.

3.png

  • 3. Complete the “Job Details,” before selecting “Analysis Configuration”

4.png

  • 4. Specify a 10m bucket span on the “Analysis Configuration” Panel. The fields “ad.a0” and “agentHost” as influencers.

5.png

  • 5. Select “Add Detector” and enter the configuration shown below.  This completes our configuration and provides the core of the job configuration described above.

6.png

  • 6. Save the Job, selecting “Close” when prompted to run the job.

Run the Job

Irrespective of whether you have created the job through the script or via the UI, execute the analysis by following the steps below.

  • 1. Click the play icon to schedule the job

7.png

  • 2. Select all defaults on the following screen, clicking “Start”.

Examine the Results

Once the job has run, we can navigate to the results. Note, although the dataset contains over 200k documents, our targeted queries means we only need to process around 30k.  This “targeting” of subsets of data with a query represents an easy way to optimise the performance of ML jobs through the fast search capabilities of Elasticsearch.

Our results highlight several anomalies for the 2 week period.

8.png

By selecting the influencer “ad.a0” from the “View by” dropdown, we are able to rapidly identify those processes which have been identified as being rare for each specific server.  The same rare processes are highlighted across the full period of the dataset - principally “nc”, “wget”, and “mail”.  The running of a mail process is highlighted as unusual for the server on which it was observed. Of course, if you were running this job on your email servers, “mail” would certainly not show up as rare!  

To further investigate the rare occurrence of a process on a server, we can take advantage of the custom_url included in the job configuration details above (not shown in the UI-based example).  To achieve this, select an anomaly from the list and click the provided link “Explore Process on Server” as shown below.

9.png

This links us to the Discover view, where we are able to see the “mail” process sending emails to the “vodkaroom.ru” domain - obviously this isn’t a risky domain in reality and has been added for purposes of example!

10.png

Note that we only have a single server in our small dataset for the purposes of example. In reality this job would be used to monitor a much larger set of servers.

Alerting on the Results

Whilst this anomaly detection is extremely valuable, and far more efficient/selective than the previous rule-based approach, we have not yet shown how to fully automate alerting on detected anomalies.  Without this automation, we would require periodic inspection of the swimline visuals to check for anomalies - which can also be error-prone and unscalable in practice, especially as we add more ML jobs.  

In order to be alerted to unusual processes, we can take advantage of the anomaly results indexed by our ML job.  This job will execute periodically, indexing its results into an .ml-anomalies-shared index.  More specifically it indexes 3 "levels" of results, that can be queried via the result_type field:

  • “bucket” level - Answers: How unusual was the job in a particular bucket of time? Essentially an aggregated anomaly score – useful for rate-limited alerts
  • “record” level - Answers: What individual anomalies are present in a range of time? All the detailed anomaly information, but records can be numerous in big data
  • “Influencers” - Answers: What are the most unusual entities in a range of time?

As described here, we can use the above information with X-Pack Alerting to proactively alert when a high scoring time bucket is detected by the job - attaching information as to the likely anomalies responsible.  This is appropriate for this specific job, as in practice it is still likely to generate a large number of low scoring anomalies. No single anomaly is likely to be the “magic bullet” to detect suspicious activity - we are therefore interested in buckets with an aggregated score of critical.  Our Anomaly Explorer shows a single critical bucket that we’d expect to be alerted on. Below we adapt the watch used by Machine Learning in version 5.5 - notice how the query restricts to buckets with a score >= 75.

{
 "trigger": {
   "schedule": {
     "interval": "5m"
   }
 },
 "metadata": {
   "min_anomaly_score": 75,
   "time_period":"20m"
 },
 "input": {
   "search": {
     "request": {
       "search_type": "query_then_fetch",
       "indices": [
         ".ml-anomalies-*"
       ],
       "types": [],
       "body": {
         "size": 0,
         "query": {
           "bool": {
             "filter": [
               {
                 "term": {
                   "job_id": "unusual_process"
                 }
               },
               {
                 "range": {
                   "timestamp": {
                     "gte": "{{ctx.trigger.scheduled_time}}||-{{ctx.metadata.time_period}}"
                   }
                 }
               },
               {
                 "terms": {
                   "result_type": [
                     "bucket",
                     "record",
                     "influencer"
                   ]
                 }
               }
             ]
           }
         },
         "aggs": {
           "bucket_results": {
             "filter": {
               "range": {
                 "anomaly_score": {
                   "gte": 75
                 }
               }
             },
             "aggs": {
               "top_bucket_hits": {
                 "top_hits": {
                   "sort": [
                     {
                       "anomaly_score": {
                         "order": "desc"
                       }
                     }
                   ],
                   "_source": {
                     "includes": [
                       "job_id",
                       "result_type",
                       "timestamp",
                       "anomaly_score",
                       "is_interim"
                     ]
                   },
                   "size": 1,
                   "script_fields": {
                     "start": {
                       "script": {
                         "lang": "painless",
                         "inline": "new Date(doc[\"timestamp\"].date.getMillis()-doc[\"bucket_span\"].value * 1000 * params.padding)",
                         "params": {
                           "padding": 10
                         }
                       }
                     },
                     "end": {
                       "script": {
                         "lang": "painless",
                         "inline": "new Date(doc[\"timestamp\"].date.getMillis()+doc[\"bucket_span\"].value * 1000 * params.padding)",
                         "params": {
                           "padding": 10
                         }
                       }
                     },
                     "timestamp_epoch": {
                       "script": {
                         "lang": "painless",
                         "inline": "doc[\"timestamp\"].date.getMillis()/1000"
                       }
                     },
                     "timestamp_iso8601": {
                       "script": {
                         "lang": "painless",
                         "inline": "doc[\"timestamp\"].date"
                       }
                     },
                     "score": {
                       "script": {
                         "lang": "painless",
                         "inline": "Math.round(doc[\"anomaly_score\"].value)"
                       }
                     }
                   }
                 }
               }
             }
           },
           "influencer_results": {
             "filter": {
               "range": {
                 "influencer_score": {
                   "gte": 3
                 }
               }
             },
             "aggs": {
               "top_influencer_hits": {
                 "top_hits": {
                   "sort": [
                     {
                       "influencer_score": {
                         "order": "desc"
                       }
                     }
                   ],
                   "_source": {
                     "includes": [
                       "result_type",
                       "timestamp",
                       "influencer_field_name",
                       "influencer_field_value",
                       "influencer_score",
                       "isInterim"
                     ]
                   },
                   "size": 3,
                   "script_fields": {
                     "score": {
                       "script": {
                         "lang": "painless",
                         "inline": "Math.round(doc[\"influencer_score\"].value)"
                       }
                     }
                   }
                 }
               }
             }
           },
           "record_results": {
             "filter": {
               "range": {
                 "record_score": {
                   "gte": 3
                 }
               }
             },
             "aggs": {
               "top_record_hits": {
                 "top_hits": {
                   "sort": [
                     {
                       "record_score": {
                         "order": "desc"
                       }
                     }
                   ],
                   "_source": {
                     "includes": [
                       "result_type",
                       "timestamp",
                       "record_score",
                       "is_interim",
                       "function",
                       "field_name",
                       "by_field_value",
                       "over_field_value",
                       "partition_field_value"
                     ]
                   },
                   "size": 3,
                   "script_fields": {
                     "score": {
                       "script": {
                         "lang": "painless",
                         "inline": "Math.round(doc[\"record_score\"].value)"
                       }
                     }
                   }
                 }
               }
             }
           }
         }
       }
     }
   }
 },
 "condition": {
   "compare": {
     "ctx.payload.aggregations.bucket_results.doc_count": {
       "gt": 0
     }
   }
 },
 "actions": {
   "log": {
     "logging": {
       "level": "info",
       "text": "Alert for job [{{ctx.payload.aggregations.bucket_results.top_bucket_hits.hits.hits.0._source.job_id}}] at [{{ctx.payload.aggregations.bucket_results.top_bucket_hits.hits.hits.0.fields.timestamp_iso8601.0}}] score [{{ctx.payload.aggregations.bucket_results.top_bucket_hits.hits.hits.0.fields.score.0}}]"
     }
   }
 }
}

Check out “Alerting on Machine Learning Jobs in Elasticsearch v5.5” for further details, specifically the insights detailed in the Advanced section. For the purposes of example, we simply log to Elasticsearch that an event has occurred.  In practice this alert would utilise a more elaborate action, such as email, to inform the appropriate administrator.  Applying this watch to our static dataset requires us to execute the watch as a sliding window, gradually covering each time period.  We reuse the script from earlier posts to achieve this. Executing this watch as described here, results in the watch detecting the single critical anomaly shown earlier:

[2017-07-26T14:55:32,749][INFO ][o.e.x.w.a.l.ExecutableLoggingAction] [6S5dxZK] Alert for job [unusual_process] at [2017-06-12T07:30:00.000Z] score [78]

In reality, our alerting criteria is likely to be more complex.  Whilst a period of unusual processes might be suspicious,  in very large infrastructures this alone is unlikely to represent an event worthy of a security analyst’s attention.  However, if coupled with other signatures (e.g., unusual logins), further inspection is likely warranted. The above watch could therefore be tailored to potentially look for multiple ML-detected anomalies, or even apply further static rules.

Normally, the above watch would execute periodically, matching on documents generated by the ML job in real time. Given the static nature of our dataset, we did include this.

Summary

In this post, we’ve seen how X-Pack machine learning jobs can be used to automate the analysis of a security-related dataset to detect unusual activity that may be associated with unauthorized access or an actual cyber attack.  The key advantage here is that we accomplished this analysis without the complex watch described in our previous post, requiring only a simple watch to notify us when critical anomalies have been detected.  Using X-Pack machine learning allows our security analysis to scale.  Imagine running 10-20 ML jobs, each working for you to detect usual activity in the various types of security log data you have in Elasticsearch!