Classificationedit

This functionality is in technical preview and may be changed or removed in a future release. Elastic will work to fix any issues, but features in technical preview are not subject to the support SLA of official GA features.

Classification is a machine learning process for predicting the class or category of a given data point in a dataset. Typical examples of classification problems are predicting loan risk, classifying music, or detecting cancer in a DNA sequence. In the first case, for example, our dataset consists of data on loan applicants that covers investment history, employment status, debit status, and so on. Based on historical data, the classification analysis predicts whether it is safe or risky to lend money to a given loan applicant. In the second case, the data we have represents songs and the analysis – based on the features of the data points – classifies the songs as hip-hop, country, classical, or any other genres available in the set of categories we have. Therefore, classification is for predicting discrete, categorical values, unlike regression analysis which predicts continuous, numerical values.

From the perspective of the possible output, there are two types of classification: binary and multi-class classification. In binary classification the variable you want to predict has only two potential values. The loan example above is a binary classification problem where the two potential outputs are safe or risky. The music classification problem is an example of multi-class classification where there are many different potential outputs; one for every possible music genre. In the 7.6.2 version of the Elastic Stack, you can perform only binary classification analysis.

Feature variablesedit

When you perform classification, you must identify a subset of fields that you want to use to create a model for predicting another field value. We refer to these fields as feature variables and dependent variable, respectively. Feature variables are the values that the dependent variable value depends on. There are three different types of feature variables that you can use with our classification algorithm: numerical, categorical, and boolean. Arrays are not supported in the feature variable fields.

Training the classification modeledit

Classification – just like regression – is a supervised machine learning process. It means that you need to supply a labeled training dataset that has some feature variables and a dependent variable. The classification algorithm learns the relationships between the features and the dependent variable. Once you’ve trained the model on your training dataset, you can reuse the knowledge that the model has learned about the relationships between the data points to classify new data. Your training dataset should be approximately balanced which means the number of data points belonging to the various classes should not be widely different, otherwise the classification analysis may not provide the best predictions. Read Imbalanced class sizes affect classification performance to learn more.

Classification algorithmsedit

The ensemble algorithm that we use in the Elastic Stack is a type of boosting called boosted tree regression model which combines multiple weak models into a composite one. We use decision trees to learn to predict the probability that a data point belongs to a certain class.

Interpreting classification resultsedit

The following sections help you understand and interpret the results of a classification analysis.

class_probabilityedit

The value of class_probability shows how likely it is that a given datapoint belongs to a certain class. It is a value between 0 and 1. The higher the number, the higher the probability that the data point belongs to the named class. This information is stored in the top_classes array for each document in your destination index. See the Viewing classification results section in the classification example.

class_scoreedit

The value of class_score controls the probability at which a class label is assigned to a datapoint. In normal case – that you maximize the number of correct labels – a class label is assigned when its predicted probability is greater than 0.5. The class_score makes it possible to change this behavior, so it can be less than or greater than 0.5. For example, suppose our two classes are denoted class 0 and class 1, then the value of class_score is always non-negative and its definition is:

class_score(class 0) = 0.5 / (1.0 - k) * probability(class 0)
class_score(class 1) = 0.5 / k * probability(class 1)

Here, k is a positive constant less than one. It represents the predicted probability of class 1 for a datapoint at which to label it class 1 and is chosen to maximise the minimum recall of any class. This is useful for example in case of highly imbalanced data. If class 0 is much more frequent in the training data than class 1, then it can mean that you achieve the best accuracy by assigning class 0 to every datapoint. This is equivalent to zero recall for class 1. Instead of this behavior, the default scheme of the Elastic Stack classification analysis is to choose k < 0.5 and accept a higher rate of actual class 0 predicted class 1 errors, or in other words, a slight degradation of the overall accuracy.

Feature importanceedit

Feature importance is calculated for supervised machine learning methods such as regression and classification. This value provides further insight into the results of a data frame analytics job and therefore helps interpret these results. As we mentioned, there are multiple features of a data point that are analyzed during data frame analytics. These features are responsible for a particular prediction to varying degrees. Feature importance shows to what degree a given feature of a data point contributes to the prediction. The feature importance value of a feature can be either positive or negative depending on its effect on the prediction. If the feature reduces the prediction value, the value is negative. If the feature increases the prediction, the feature importance value positive. The magnitude of the feature importance value shows how significantly the feature affects the prediction both locally (for a given data point) or generally (for the whole data set).

Feature importance in the Elastic Stack is calculated using the SHAP (SHapley Additive exPlanations) method as described in Lundberg, S. M., & Lee, S.-I. A Unified Approach to Interpreting Model Predictions. In NeurIPS 2017.

By default, feature importance values are not calculated. To generate this information, when you create a data frame analytics job you must specify the num_top_feature_importance_values property. The feature importance values are stored in the destination index in fields prefixed by ml.feature_importance.

The number of feature importance values for each document might be less than the num_top_feature_importance_values property value. For example, it returns only features that had a positive or negative effect on the prediction.

Measuring model performanceedit

You can measure how well the model has performed on your dataset by using the classification evaluation type of the evaluate data frame analytics API. The metric that the evaluation provides you is the multi-class confusion matrix which tells you how many times a given data point that belongs to a given class was classified correctly and incorrectly. In other words, how many times your data point that belongs to the X class was mistakenly classified as Y.

Another crucial measurement is how well your model performs on unseen data points. To assess how well the trained model will perform on data it has never seen before, you must set aside a proportion of the training dataset for testing. This split of the dataset is the testing dataset. Once the model has been trained, you can let the model predict the value of the data points it has never seen before and compare the prediction to the actual value by using the evaluate data frame analytics API.