Regression analysis is a machine learning process for estimating the relationships among different fields in your data, then making further predictions based on these relationships.
For example, suppose we are interested in finding the relationship between apartment size and monthly rent in a city. Our imaginary data set consists of three data points:
After the model determines the relationship between the apartment size and the rent, it can make predictions such as the monthly rent of a hundred square meter-size apartment.
This is a simple example. Usually regression problems are multi-dimensional, so the relationships that regression analysis tries to find are between multiple fields. To extend our example, a more complex regression analysis could take into account additional factors such as the location of the apartment in the city, on which floor it is, and whether the apartment has a riverside view or not, and so on. All of these factors can be considered features; they are measurable properties or characteristics of the phenomenon we’re studying.
When you perform regression analysis, you must identify a subset of fields that you want to use to create a model for predicting other fields. We refer to these fields as feature variables and dependent variables, respectively. Feature variables are the values that the dependent variable value depends on. If one or more of the feature variables changes, the dependent variable value also changes. There are three different types of feature variables that you can use with our regression algorithm:
- Numerical. In our example, the size of the apartment was a numerical feature variable.
- Categorical. A variable that can have one value from a set of values. The value set has a fixed and limited number of possible items. In the example, the location of the apartment in the city (borough) is a categorical variable.
- Boolean. The riverside view in the example is a boolean value because an apartment either has a riverside view or doesn’t have one. Arrays are not supported.
Regression is a supervised machine learning method, which means that you need to supply a labeled training data set that has some feature variables and a dependent variable. The regression algorithm identifies the relationships between the feature variables and the dependent variable. Once you’ve trained the model on your training data set, you can reuse the knowledge that the model has learned to make inferences about new data.
The relationships between the feature variables and the dependent variable are described as a mathematical function. Regression analysis tries to find the best prediction for the dependent variable by combining the predictions from multiple base learners – algorithms that generalize from the data set. The performance of an ensemble is usually better than the performance of each individual base learner because the individual learners will make different errors. These average out when their predictions are combined.
When you create a data frame analytics job, the inference step of the process might fail if the model is too large to fit into JVM. For a workaround, refer to this GitHub issue.
The ensemble learning technique that we use in the Elastic Stack is a type of boosting called extreme gradient boost (XGboost) which combines decision trees with gradient boosting methodologies.
A loss function measures how well a given machine learning model fits the specific data set. It boils down all the different under- and overestimations of the model to a single number, known as the prediction error. The bigger the difference between the prediction and the ground truth, the higher the value of the loss function. Loss functions are used automatically in the background during hyperparameter optimization and when training the decision trees to compare the performance of various iterations of the model.
In the Elastic Stack, there are three different types of loss function:
mean squared error (
mse): It is the default choice when no additional information about the data set is available.
mean squared logarithmic error (
msle; a variation of
mse): It is for cases where the target values are all positive with a long tail distribution (for example, prices or population).
Pseudo-Huber loss (
huber): Use it when you want to prevent the model trying to fit the outliers instead of regular data.
The various types of loss function calculate the prediction error differently. The appropriate loss function for your use case depends on the target distribution in your data set, the problem that you want to model, the number of outliers in the data, and so on.
You can specify the loss function to be used during regression analysis when you
create the data frame analytics job. The default is mean squared error (
mse). If you
huber, you can also set up a parameter for the loss function.
With the parameter, you can further refine the behavior of the chosen functions.
Consult the Jupyter notebook on regression loss functions to learn more.
The default loss function parameter values work fine for most of the cases. It is highly recommended to use the default values, unless you fully understand the impact of the different loss function parameters.
The model that you created is stored as Elasticsearch documents in internal indices. In other words, the characteristics of your trained model are saved and ready to be deployed and used as functions. The inference feature enables you to use your model in a preprocessor of an ingest pipeline or in a pipeline aggregation of a search query to make predictions about your data.
Feature importance provides further information about the results of an analysis and helps to interpret the results in a more subtle way. If you want to learn more about feature importance, click here.
You can measure how well the model has performed on your training data set by
regression evaluation type of the
evaluate data frame analytics API. The mean squared
error (MSE) value that the evaluation provides you on the training data set is
the training error. Training the regression model means finding the
combination of model parameters that produces the lowest possible training
Another crucial measurement is how well your model performs on unseen data points. To assess how well the trained model will perform on data it has never seen before, you must set aside a proportion of the training data set for testing. This split of the data set is the testing data set. Once the model has been trained, you can let the model predict the value of the data points it has never seen before and compare the prediction to the actual value. This test provides an estimate of a quantity known as the model generalization error.
Two concepts describe how well the regression algorithm was able to learn the relationship between the feature variables and the dependent variable. Underfitting is when the model cannot capture the complexity of the data set. Overfitting is when the model is too specific to the training data set and is capturing details which do not generalize to new data. A model that overfits the data has a low MSE value on the training data set and a high MSE value on the testing data set. For more information about the evaluation metrics, see Regression evaluation.