This functionality is experimental and may be changed or removed completely in a future release. Elastic will take a best effort approach to fix any issues, but experimental features are not subject to the support SLA of official GA features.
Classification is a machine learning process for predicting the class or category of a given data point in a dataset. Typical examples of classification problems are predicting loan risk, classifying music, or detecting cancer in a DNA sequence. In the first case, for example, our dataset consists of data on loan applicants that covers investment history, employment status, debit status, and so on. Based on historical data, the classification analysis predicts whether it is safe or risky to lend money to a given loan applicant. In the second case, the data we have represents songs and the analysis – based on the features of the data points – classifies the songs as hip-hop, country, classical, or any other genres available in the set of categories we have. Therefore, classification is for predicting discrete, categorical values, unlike regression analysis which predicts continuous, numerical values.
From the perspective of the possible output, there are two types of
classification: binary and multi-class classification. In binary
classification the variable you want to predict has only two potential values.
The loan example above is a binary classification problem where the two
potential outputs are
risky. The music classification problem is an
example of multi-class classification where there are many different potential
outputs; one for every possible music genre. In the 7.6.2 version of the
Elastic Stack, you can perform only binary classification analysis.
When you perform classification, you must identify a subset of fields that you want to use to create a model for predicting another field value. We refer to these fields as feature variables and dependent variable, respectively. Feature variables are the values that the dependent variable value depends on. There are three different types of feature variables that you can use with our classification algorithm: numerical, categorical, and boolean. Arrays are not supported in the feature variable fields.
Classification – just like regression – is a supervised machine learning process. It means that you need to supply a labeled training dataset that has some feature variables and a dependent variable. The classification algorithm learns the relationships between the features and the dependent variable. Once you’ve trained the model on your training dataset, you can reuse the knowledge that the model has learned about the relationships between the data points to classify new data. Your training dataset should be approximately balanced which means the number of data points belonging to the various classes should not be widely different, otherwise the classification analysis may not provide the best predictions. Read Imbalanced class sizes affect classification performance to learn more.
The ensemble algorithm that we use in the Elastic Stack is a type of boosting called boosted tree regression model which combines multiple weak models into a composite one. We use decision trees to learn to predict the probability that a data point belongs to a certain class.
The following sections help you understand and interpret the results of a classification analysis.
The value of
class_probability shows how likely it is that a given datapoint
belongs to a certain class. It is a value between 0 and 1. The higher the
number, the higher the probability that the data point belongs to the named
class. This information is stored in the
top_classes array for each document in your destination index. See the
Viewing classification results
section in the classification example.
The value of
class_score controls the probability at which a class label is
assigned to a datapoint. In normal case – that you maximize the number of
correct labels – a class label is assigned when its predicted probability is
greater than 0.5. The
class_score makes it possible to change this behavior,
so it can be less than or greater than 0.5. For example, suppose our two classes
class 0 and
class 1, then the value of
class_score is always
non-negative and its definition is:
class_score(class 0) = 0.5 / (1.0 - k) * probability(class 0) class_score(class 1) = 0.5 / k * probability(class 1)
k is a positive constant less than one. It represents the predicted
class 1 for a datapoint at which to label it
class 1 and is
chosen to maximise the minimum recall of any class. This is useful for example
in case of highly imbalanced data. If
class 0 is much more frequent in the
training data than
class 1, then it can mean that you achieve the best
accuracy by assigning
class 0 to every datapoint. This is equivalent to zero
class 1. Instead of this behavior, the default scheme of the
Elastic Stack classification analysis is to choose
k < 0.5 and accept a higher rate of
class 0 predicted
class 1 errors, or in other words, a slight
degradation of the overall accuracy.
Feature importance is calculated for supervised machine learning methods such as regression and classification. This value provides further insight into the results of a data frame analytics job and therefore helps interpret these results. As we mentioned, there are multiple features of a data point that are analyzed during data frame analytics. These features are responsible for a particular prediction to varying degrees. Feature importance shows to what degree a given feature of a data point contributes to the prediction. The feature importance value of a feature can be either positive or negative depending on its effect on the prediction. If the feature reduces the prediction value, the value is negative. If the feature increases the prediction, the feature importance value positive. The magnitude of the feature importance value shows how significantly the feature affects the prediction both locally (for a given data point) or generally (for the whole data set).
Feature importance in the Elastic Stack is calculated using the SHAP (SHapley Additive exPlanations) method as described in Lundberg, S. M., & Lee, S.-I. A Unified Approach to Interpreting Model Predictions. In NeurIPS 2017.
By default, feature importance values are not calculated. To generate this
information, when you create a data frame analytics job you must specify the
num_top_feature_importance_values property. The feature importance values are
stored in the destination index in fields prefixed by
The number of feature importance values for each document might be less
num_top_feature_importance_values property value. For example, it
returns only features that had a positive or negative effect on the prediction.
You can measure how well the model has performed on your dataset by using the
classification evaluation type of the
evaluate data frame analytics API. The metric that the
evaluation provides you is the multi-class confusion matrix which tells you how
many times a given data point that belongs to a given class was classified
correctly and incorrectly. In other words, how many times your data point that
belongs to the X class was mistakenly classified as Y.
Another crucial measurement is how well your model performs on unseen data points. To assess how well the trained model will perform on data it has never seen before, you must set aside a proportion of the training dataset for testing. This split of the dataset is the testing dataset. Once the model has been trained, you can let the model predict the value of the data points it has never seen before and compare the prediction to the actual value by using the evaluate data frame analytics API.