## Outlier detectionedit

Outlier detection is an analysis for identifying data points (outliers) whose feature values are different from those of the normal data points in a particular data set. Outliers may denote errors or unusual behavior.

We use unsupervised outlier detection which means there is no need to provide a training data set to teach outlier detection to recognize outliers. Unsupervised outlier detection uses various machine learning techniques to find which data points are unusual compared to the majority of the data points.

In the Elastic Stack, we use an ensemble of four different distance and density based
outlier detection methods. By default, you don’t need to select the methods or
provide any parameters, but you can override the default behavior if you like.
The basic assumption of the **distance based methods** is that normal data
points – in other words, points that are not outliers – have a lot of neighbors
nearby, because we expect that in a population the majority of the data points
have similar feature values, while the minority of the data points – the
outliers – have different feature values and will, therefore, be far away from
the normal points.

The distance of K^{th} nearest neighbor method (`distance_kth_nn`

) computes the
distance of the data point to its K^{th} nearest neighbor where K is a small
number and usually independent of the total number of data points. The higher
this distance the more the data point is an outlier.

The distance of K-nearest neighbors method (`distance_knn`

) calculates the
average distance of the data points to their nearest neighbors. Points with the
largest average distance will be the most outlying.

While the results of the distance based methods are easy to interpret, their
drawback is that they don’t take into account the density variations of a
data set. This is the point where **density based methods** come into the
picture, they are used for mitigating this problem. These methods take into
account not only the distance of the points to their K nearest neighbors but
also the distance of these neighbors to their neighbors.

Based on this approach, a metric is computed called local outlier factor
(`lof`

) for each data point. The higher the local outlier factor, the more
outlying is the data point.

The other density based method that outlier detection uses is the local
distance-based outlier factor (`ldof`

). Ldof is a ratio of two measures: the
first computes the average distance of the data point to its K nearest
neighbors; the second computes the average of the pairwise distances of the
neighbors themselves. Again, the higher the value the more the data point is an
outlier.

As you can see, these four algorithms work differently, so they don’t always agree on which points are outliers. By default, we use all these methods during outlier detection, then normalize and combine their results and give every datapoint in the index an outlier score. The outlier score ranges from 0 to 1, where the higher number represents the chance that the data point is an outlier compared to the other data points in the index.

Outlier detection is a batch analysis, it runs against your data once. If new data comes into the index, you need to do the analysis again on the altered data.

### Feature influenceedit

Besides the outlier score, another value is calculated during outlier detection: the feature influence score. As we mentioned, there are multiple features of a data point that are analyzed during outlier detection. An influential feature is a feature of a data point that is responsible for the point being an outlier. The value of feature influence provides a relative ranking of features by their contribution to a point being an outlier. Therefore, while outlier score tells us whether a data point is an outlier, feature influence shows which features make the point an outlier. By doing this, this value provides context to help understand more about the reasons for the data point being unusual and can drive visualizations.