## Regressionedit

This functionality is in technical preview and may be changed or removed in a future release. Elastic will apply best effort to fix any issues, but features in technical preview are not subject to the support SLA of official GA features.

Regression analysis is a machine learning process for estimating the relationships among different fields in your data, then making further predictions based on these relationships.

For example, suppose we are interested in finding the relationship between apartment size and monthly rent in a city. Our imaginary data set consists of three data points:

 Size (m2) Monthly rent 44 1600 24 1055 63 2300

After the model determines the relationship between the apartment size and the rent, it can make predictions such as the monthly rent of a hundred square meter-size apartment.

This is a simple example. Usually regression problems are multi-dimensional, so the relationships that regression analysis tries to find are between multiple fields. To extend our example, a more complex regression analysis could take into account additional factors such as the location of the apartment in the city, on which floor it is, and whether the apartment has a riverside view or not, and so on. All of these factors can be considered features; they are measurable properties or characteristics of the phenomenon we’re studying.

### Feature variablesedit

When you perform regression analysis, you must identify a subset of fields that you want to use to create a model for predicting other fields. We refer to these fields as feature variables and dependent variables, respectively. Feature variables are the values that the dependent variable value depends on. If one or more of the feature variables changes, the dependent variable value also changes. There are three different types of feature variables that you can use with our regression algorithm:

• Numerical. In our example, the size of the apartment was a numerical feature variable.
• Categorical. A variable that can have one value from a set of values. The value set has a fixed and limited number of possible items. In the example, the location of the apartment in the city (borough) is a categorical variable.
• Boolean. The riverside view in the example is a boolean value because an apartment either has a riverside view or doesn’t have one. Arrays are not supported.

### Training the regression modeledit

Regression is a supervised machine learning method, which means that you need to supply a labeled training data set that has some feature variables and a dependent variable. The regression algorithm identifies the relationships between the feature variables and the dependent variable. Once you’ve trained the model on your training data set, you can reuse the knowledge that the model has learned to make inferences about new data.

The relationships between the feature variables and the dependent variable are described as a mathematical function. Regression analysis tries to find the best prediction for the dependent variable by combining the predictions from multiple base learners – algorithms that generalize from the data set. The performance of an ensemble is usually better than the performance of each individual base learner because the individual learners will make different errors. These average out when their predictions are combined.

Regression works as a batch analysis. If new data comes into your index, you must restart the data frame analytics job.

#### Regression algorithmsedit

The ensemble learning technique that we use in the Elastic Stack is a type of boosting called extreme gradient boost (XGboost) which combines decision trees with gradient boosting methodologies.

### Feature importanceedit

You can measure how well the model has performed on your training data set by using the `regression` evaluation type of the evaluate data frame analytics API. The mean squared error (MSE) value that the evaluation provides you on the training data set is the training error. Training the regression model means finding the combination of model parameters that produces the lowest possible training error.