11 September 2018 Engineering

Temporal vs. Population Analysis in Elastic Machine Learning

By Rich Collier

Elastic machine learning allows users to discover two major flavors of anomalies, those that are temporal in nature (with respect to time) and those that are population-based (with respect to all others). But, what are the differences between these two types and under what circumstances would you use one kind over the other? This blog discusses the details behind the analyses, their merits, and best practices based upon common rules of thumb.

First of all, let’s define what we mean by temporal and population-based anomalies. To do this, let’s lay out some facts:

Temporal Anomaly Detection

  • Is the default behavior of Elastic machine learning, unless you specifically define the analysis to be population-based
  • Is a comparison of the behavior of an entity with respect to itself over time
  • Can leverage a “split” (via by_field_name or partition_field_name) to create individual baselines for each instance of something (i.e. a unique baseline per host or user)
  • Not really suitable for high cardinality or very sparse data elements (for example, external IP addresses of visitors to a website may be both sparse and have high cardinality — in the hundreds of thousands or more). This will cause performance problems if the job’s memory limits aren’t increased from the default limit. Even if that were done, however, it is likely that population analysis is still more suitable approach.

Population Anomaly Detection

  • Is only invoked when the over_field_name is used (or the Population job wizard in Kibana is used). The field declared as the over_field_name is what defines the population.
  • Is a peer analysis of members within a population — or, more accurately — a comparison of an individual entity against a collective model of all peers as witnessed over time.
  • Can also leverage a “split” (via by_field_name or partition_field_name) to create sub-populations
  • Usually quite suitable for high cardinality or sparse data elements because individual members contribute towards a collective model of behavior.

Example Use Case

Now that we’ve got that out of the way, let’s discuss a hypothetical use-case that might assist us in deciding which is the right approach to choose.

Imagine we would like to track the number of documents downloaded from an internal document server and let’s assume that the document server contains documents that are broadly applicable to everyone in the company (a large company with 50,000+ employees). Additionally, anyone can access this document server at any time. Finally, let’s also assume that the document server’s access log tracks every time a document is accessed, as well as the user making the download.

Temporal Anomaly Example

If I choose Temporal Anomaly detection and structure an analysis that resembles:

    "detectors": [
      {
        "function": "count",
        "partition_field_name": "user"
      }
    ]

Then, I will expect to track the volume of downloads, for each user, by name, individually over time. Therefore, if user=”Wilma” typically downloads around 50 documents per day (she’s in Engineering and is a heavy user of the documents on the server), it would only be anomalous if Wilma drastically changed her behavior and downloaded far fewer or many more than she usually does (like 5,000 documents per day).

Also, as a consideration mentioned earlier, I would need to be aware of how many distinct users I’m expecting ML to keep track of. At 50,000 unique users, it might be manageable if we increased the memory limit on the ML job. However, another thing to keep in mind is that a user may not consistently download documents every single day — so their activity may be quite sparse — they may download a few today, then not again until months from now. It might be hard to establish an accurate baseline per user if there aren’t consistent observations available per user.

But, most importantly, what if a new person (user=”Fred”) comes along and downloads 5,000 documents in one day and never does it again? Was this unusual? Is Fred an insider threat or did some piece of data exfiltration malware just use Fred’s credentials to steal a bunch of documents with the hopes of transmitting them outside of the organization? When analyzed with temporal anomaly detection, the answer is indeterminate - we don’t really know if it is unusual for “Fred” since we don’t have any history for him to compare against (we’ve only seen one sample).

Population Anomaly Example

Alternatively, if we frame the problem using population anomaly detection, such as:

    "detectors": [
      {
        "function": "count",
        "over_field_name": "user"
      }
    ]

Then, we can accomplish the following:

  • We can use whatever users happen to be present in each bucket_span (in this case, a day) to build a collective model of what’s the typical download count for users, collectively
  • Compare each user seen in each day against that collective model

Now, if Wilma and the majority of her cohorts download 10-75 documents per day, the overall “collective” model of typical usage will be in that range. If Fred comes along one day and downloads 5,000 documents, it will be anomalous because Fred is being compared against Wilma and all her cohorts. Fred will be highlighted as an outlier.

There is a key point, however. If Wilma happens to also download the 5,000 documents instead of Fred, Wilma will be the one flagged as an anomalous outlier, not specifically because she’s downloaded more than she usually does, rather because she’s downloaded more than the “typical” model of usage that she herself helped establish over time, along with her cohorts.

The point is, Wilma is highlighted as an outlier in this situation regardless of the choice of Temporal or Population detection, but the Population approach is more flexible in this case because it:

  1. Is impervious to sparse data
  2. Is more efficient, memory-wise, when the number of distinct users gets high (because individual models per user aren’t necessary)
  3. Finds outliers like Fred that have little or no history
  4. Still sees Wilma as an outlier if her behavior changes drastically to the point where it is outside the collective behavior

One additional tip: if you’re doing population analysis, it makes the most sense that you construct your populations to be as homogenous as possible. In other words, if you have major clusters of user types (i.e. Engineers vs. HR people), it might be more effective to split those users into separate groups and do two individual analyses than expect them both to be lumped together.

And one last note: if you have members of a population that you know in advance that you’d like to exclude, then consider filtering them out by either filtering the input data, or using rules to filter out the results.

Hopefully, this explanation was helpful in comparing the two approaches. If you’d like to try machine learning on your data, download the Elastic Stack and enable the 30-day trial license, or start a free trial on Elastic Cloud.