Tech Topics

Temporal vs. Population Analysis in Elastic Machine Learning

Elastic machine learning allows users to discover two major flavors of anomalies, those that are temporal in nature (with respect to time) and those that are population-based (with respect to all others). But, what are the differences between these two types and under what circumstances would you use one kind over the other? This blog discusses the details behind the analyses, their merits, and best practices based upon common rules of thumb.

First of all, lets define what we mean by temporal and population-based anomalies. To do this, lets lay out some facts:

Temporal Anomaly Detection

  • Is the default behavior of Elastic machine learning, unless you specifically define the analysis to be population-based
  • Is a comparison of the behavior of an entity with respect to itself over time
  • Can leverage a split (via by_field_name or partition_field_name) to create individual baselines for each instance of something (i.e. a unique baseline per host or user)
  • Not really suitable for high cardinality or very sparse data elements (for example, external IP addresses of visitors to a website may be both sparse and have high cardinality in the hundreds of thousands or more). This will cause performance problems if the jobs memory limits arent increased from the default limit. Even if that were done, however, it is likely that population analysis is still more suitable approach.

Population Anomaly Detection

  • Is only invoked when the over_field_name is used (or the Population job wizard in Kibana is used). The field declared as the over_field_name is what defines the population.
  • Is a peer analysis of members within a population or, more accurately a comparison of an individual entity against a collective model of all peers as witnessed over time.
  • Can also leverage a split (via by_field_name or partition_field_name) to create sub-populations
  • Usually quite suitable for high cardinality or sparse data elements because individual members contribute towards a collective model of behavior.

Example Use Case

Now that weve got that out of the way, lets discuss a hypothetical use-case that might assist us in deciding which is the right approach to choose.

Imagine we would like to track the number of documents downloaded from an internal document server and lets assume that the document server contains documents that are broadly applicable to everyone in the company (a large company with 50,000+ employees). Additionally, anyone can access this document server at any time. Finally, lets also assume that the document servers access log tracks every time a document is accessed, as well as the user making the download.

Temporal Anomaly Example

If I choose Temporal Anomaly detection and structure an analysis that resembles:

    "detectors": [
        "function": "count",
        "partition_field_name": "user"

Then, I will expect to track the volume of downloads, for each user, by name, individually over time. Therefore, if user=Wilma typically downloads around 50 documents per day (shes in Engineering and is a heavy user of the documents on the server), it would only be anomalous if Wilma drastically changed her behavior and downloaded far fewer or many more than she usually does (like 5,000 documents per day).

Also, as a consideration mentioned earlier, I would need to be aware of how many distinct users Im expecting ML to keep track of. At 50,000 unique users, it might be manageable if we increased the memory limit on the ML job. However, another thing to keep in mind is that a user may not consistently download documents every single day so their activity may be quite sparse they may download a few today, then not again until months from now. It might be hard to establish an accurate baseline per user if there arent consistent observations available per user.

But, most importantly, what if a new person (user=Fred) comes along and downloads 5,000 documents in one day and never does it again? Was this unusual? Is Fred an insider threat or did some piece of data exfiltration malware just use Freds credentials to steal a bunch of documents with the hopes of transmitting them outside of the organization? When analyzed with temporal anomaly detection, the answer is indeterminate - we dont really know if it is unusual for Fred since we dont have any history for him to compare against (weve only seen one sample).

Population Anomaly Example

Alternatively, if we frame the problem using population anomaly detection, such as:

    "detectors": [
        "function": "count",
        "over_field_name": "user"

Then, we can accomplish the following:

  • We can use whatever users happen to be present in each bucket_span (in this case, a day) to build a collective model of whats the typical download count for users, collectively
  • Compare each user seen in each day against that collective model

Now, if Wilma and the majority of her cohorts download 10-75 documents per day, the overall collective model of typical usage will be in that range. If Fred comes along one day and downloads 5,000 documents, it will be anomalous because Fred is being compared against Wilma and all her cohorts. Fred will be highlighted as an outlier.

There is a key point, however. If Wilma happens to also download the 5,000 documents instead of Fred, Wilma will be the one flagged as an anomalous outlier, not specifically because shes downloaded more than she usually does, rather because shes downloaded more than the typical model of usage that she herself helped establish over time, along with her cohorts.

The point is, Wilma is highlighted as an outlier in this situation regardless of the choice of Temporal or Population detection, but the Population approach is more flexible in this case because it:

  1. Is impervious to sparse data
  2. Is more efficient, memory-wise, when the number of distinct users gets high (because individual models per user arent necessary)
  3. Finds outliers like Fred that have little or no history
  4. Still sees Wilma as an outlier if her behavior changes drastically to the point where it is outside the collective behavior

One additional tip: if youre doing population analysis, it makes the most sense that you construct your populations to be as homogenous as possible. In other words, if you have major clusters of user types (i.e. Engineers vs. HR people), it might be more effective to split those users into separate groups and do two individual analyses than expect them both to be lumped together.

And one last note: if you have members of a population that you know in advance that youd like to exclude, then consider filtering them out by either filtering the input data, or using rules to filter out the results.

Hopefully, this explanation was helpful in comparing the two approaches. If youd like to try machine learning on your data, download the Elastic Stack and enable the 30-day trial license, or start a free trial on Elastic Cloud.