Create a data frame analytics job Generally available; Added in 7.3.0

PUT /_ml/data_frame/analytics/{id}

This API creates a data frame analytics job that performs an analysis on the source indices and stores the outcome in a destination index. By default, the query used in the source configuration is {"match_all": {}}.

If the destination index does not exist, it is created automatically when you start the job.

If you supply only a subset of the regression or classification parameters, hyperparameter optimization occurs. It determines a value for each of the undefined parameters.

Required authorization

  • Index privileges: create_index,index,manage,read,view_index_metadata
  • Cluster privileges: manage_ml

Path parameters

  • id string Required

    Identifier for the data frame analytics job. This identifier can contain lowercase alphanumeric characters (a-z and 0-9), hyphens, and underscores. It must start and end with alphanumeric characters.

application/json

Body Required

  • allow_lazy_start boolean

    Specifies whether this job can start when there is insufficient machine learning node capacity for it to be immediately assigned to a node. If set to false and a machine learning node with capacity to run the job cannot be immediately found, the API returns an error. If set to true, the API does not return an error; the job waits in the starting state until sufficient machine learning node capacity is available. This behavior is also affected by the cluster-wide xpack.ml.max_lazy_ml_nodes setting.

    Default value is false.

  • analysis object Required

    The analysis configuration, which contains the information necessary to perform one of the following types of analysis: classification, outlier detection, or regression.

    Hide analysis attributes Show analysis attributes object
    • classification object

      The configuration information necessary to perform classification.

      Hide classification attributes Show classification attributes object
      • alpha number

        Advanced configuration option. Machine learning uses loss guided tree growing, which means that the decision trees grow where the regularized loss decreases most quickly. This parameter affects loss calculations by acting as a multiplier of the tree depth. Higher alpha values result in shallower trees and faster training times. By default, this value is calculated during hyperparameter optimization. It must be greater than or equal to zero.

      • dependent_variable string Required

        Defines which field of the document is to be predicted. It must match one of the fields in the index being used to train. If this field is missing from a document, then that document will not be used for training, but a prediction with the trained model will be generated for it. It is also known as continuous target variable. For classification analysis, the data type of the field must be numeric (integer, short, long, byte), categorical (ip or keyword), or boolean. There must be no more than 30 different values in this field. For regression analysis, the data type of the field must be numeric.

      • downsample_factor number

        Advanced configuration option. Controls the fraction of data that is used to compute the derivatives of the loss function for tree training. A small value results in the use of a small fraction of the data. If this value is set to be less than 1, accuracy typically improves. However, too small a value may result in poor convergence for the ensemble and so require more trees. By default, this value is calculated during hyperparameter optimization. It must be greater than zero and less than or equal to 1.

      • early_stopping_enabled boolean

        Advanced configuration option. Specifies whether the training process should finish if it is not finding any better performing models. If disabled, the training process can take significantly longer and the chance of finding a better performing model is unremarkable.

        Default value is true.

      • eta number

        Advanced configuration option. The shrinkage applied to the weights. Smaller values result in larger forests which have a better generalization error. However, larger forests cause slower training. By default, this value is calculated during hyperparameter optimization. It must be a value between 0.001 and 1.

      • eta_growth_rate_per_tree number

        Advanced configuration option. Specifies the rate at which eta increases for each new tree that is added to the forest. For example, a rate of 1.05 increases eta by 5% for each extra tree. By default, this value is calculated during hyperparameter optimization. It must be between 0.5 and 2.

      • feature_bag_fraction number

        Advanced configuration option. Defines the fraction of features that will be used when selecting a random bag for each candidate split. By default, this value is calculated during hyperparameter optimization.

      • feature_processors array[object]

        Advanced configuration option. A collection of feature preprocessors that modify one or more included fields. The analysis uses the resulting one or more features instead of the original document field. However, these features are ephemeral; they are not stored in the destination index. Multiple feature_processors entries can refer to the same document fields. Automatic categorical feature encoding still occurs for the fields that are unprocessed by a custom processor or that have categorical values. Use this property only if you want to override the automatic feature encoding of the specified fields.

      • gamma number

        Advanced configuration option. Regularization parameter to prevent overfitting on the training data set. Multiplies a linear penalty associated with the size of individual trees in the forest. A high gamma value causes training to prefer small trees. A small gamma value results in larger individual trees and slower training. By default, this value is calculated during hyperparameter optimization. It must be a nonnegative value.

      • lambda number

        Advanced configuration option. Regularization parameter to prevent overfitting on the training data set. Multiplies an L2 regularization term which applies to leaf weights of the individual trees in the forest. A high lambda value causes training to favor small leaf weights. This behavior makes the prediction function smoother at the expense of potentially not being able to capture relevant relationships between the features and the dependent variable. A small lambda value results in large individual trees and slower training. By default, this value is calculated during hyperparameter optimization. It must be a nonnegative value.

      • max_optimization_rounds_per_hyperparameter number

        Advanced configuration option. A multiplier responsible for determining the maximum number of hyperparameter optimization steps in the Bayesian optimization procedure. The maximum number of steps is determined based on the number of undefined hyperparameters times the maximum optimization rounds per hyperparameter. By default, this value is calculated during hyperparameter optimization.

      • max_trees number

        Advanced configuration option. Defines the maximum number of decision trees in the forest. The maximum value is 2000. By default, this value is calculated during hyperparameter optimization.

      • num_top_feature_importance_values number

        Advanced configuration option. Specifies the maximum number of feature importance values per document to return. By default, no feature importance calculation occurs.

        Default value is 0.0.

      • prediction_field_name string

        Defines the name of the prediction field in the results. Defaults to <dependent_variable>_prediction.

      • randomize_seed number

        Defines the seed for the random generator that is used to pick training data. By default, it is randomly generated. Set it to a specific value to use the same training data each time you start a job (assuming other related parameters such as source and analyzed_fields are the same).

      • soft_tree_depth_limit number

        Advanced configuration option. Machine learning uses loss guided tree growing, which means that the decision trees grow where the regularized loss decreases most quickly. This soft limit combines with the soft_tree_depth_tolerance to penalize trees that exceed the specified depth; the regularized loss increases quickly beyond this depth. By default, this value is calculated during hyperparameter optimization. It must be greater than or equal to 0.

      • soft_tree_depth_tolerance number

        Advanced configuration option. This option controls how quickly the regularized loss increases when the tree depth exceeds soft_tree_depth_limit. By default, this value is calculated during hyperparameter optimization. It must be greater than or equal to 0.01.

      • training_percent
      • class_assignment_objective string
      • num_top_classes number

        Defines the number of categories for which the predicted probabilities are reported. It must be non-negative or -1. If it is -1 or greater than the total number of categories, probabilities are reported for all categories; if you have a large number of categories, there could be a significant effect on the size of your destination index. NOTE: To use the AUC ROC evaluation method, num_top_classes must be set to -1 or a value greater than or equal to the total number of categories.

        Default value is 2.0.

    • outlier_detection object

      The configuration information necessary to perform outlier detection. NOTE: Advanced parameters are for fine-tuning classification analysis. They are set automatically by hyperparameter optimization to give the minimum validation error. It is highly recommended to use the default values unless you fully understand the function of these parameters.

      Hide outlier_detection attributes Show outlier_detection attributes object
      • compute_feature_influence boolean

        Specifies whether the feature influence calculation is enabled.

        Default value is true.

      • feature_influence_threshold number

        The minimum outlier score that a document needs to have in order to calculate its feature influence score. Value range: 0-1.

        Default value is 0.1.

      • method string

        The method that outlier detection uses. Available methods are lof, ldof, distance_kth_nn, distance_knn, and ensemble. The default value is ensemble, which means that outlier detection uses an ensemble of different methods and normalises and combines their individual outlier scores to obtain the overall outlier score.

        Default value is ensemble.

      • n_neighbors number

        Defines the value for how many nearest neighbors each method of outlier detection uses to calculate its outlier score. When the value is not set, different values are used for different ensemble members. This default behavior helps improve the diversity in the ensemble; only override it if you are confident that the value you choose is appropriate for the data set.

      • outlier_fraction number

        The proportion of the data set that is assumed to be outlying prior to outlier detection. For example, 0.05 means it is assumed that 5% of values are real outliers and 95% are inliers.

      • standardization_enabled boolean

        If true, the following operation is performed on the columns before computing outlier scores: (x_i - mean(x_i)) / sd(x_i).

        Default value is true.

    • regression object

      The configuration information necessary to perform regression. NOTE: Advanced parameters are for fine-tuning regression analysis. They are set automatically by hyperparameter optimization to give the minimum validation error. It is highly recommended to use the default values unless you fully understand the function of these parameters.

      Hide regression attributes Show regression attributes object
      • alpha number

        Advanced configuration option. Machine learning uses loss guided tree growing, which means that the decision trees grow where the regularized loss decreases most quickly. This parameter affects loss calculations by acting as a multiplier of the tree depth. Higher alpha values result in shallower trees and faster training times. By default, this value is calculated during hyperparameter optimization. It must be greater than or equal to zero.

      • dependent_variable string Required

        Defines which field of the document is to be predicted. It must match one of the fields in the index being used to train. If this field is missing from a document, then that document will not be used for training, but a prediction with the trained model will be generated for it. It is also known as continuous target variable. For classification analysis, the data type of the field must be numeric (integer, short, long, byte), categorical (ip or keyword), or boolean. There must be no more than 30 different values in this field. For regression analysis, the data type of the field must be numeric.

      • downsample_factor number

        Advanced configuration option. Controls the fraction of data that is used to compute the derivatives of the loss function for tree training. A small value results in the use of a small fraction of the data. If this value is set to be less than 1, accuracy typically improves. However, too small a value may result in poor convergence for the ensemble and so require more trees. By default, this value is calculated during hyperparameter optimization. It must be greater than zero and less than or equal to 1.

      • early_stopping_enabled boolean

        Advanced configuration option. Specifies whether the training process should finish if it is not finding any better performing models. If disabled, the training process can take significantly longer and the chance of finding a better performing model is unremarkable.

        Default value is true.

      • eta number

        Advanced configuration option. The shrinkage applied to the weights. Smaller values result in larger forests which have a better generalization error. However, larger forests cause slower training. By default, this value is calculated during hyperparameter optimization. It must be a value between 0.001 and 1.

      • eta_growth_rate_per_tree number

        Advanced configuration option. Specifies the rate at which eta increases for each new tree that is added to the forest. For example, a rate of 1.05 increases eta by 5% for each extra tree. By default, this value is calculated during hyperparameter optimization. It must be between 0.5 and 2.

      • feature_bag_fraction number

        Advanced configuration option. Defines the fraction of features that will be used when selecting a random bag for each candidate split. By default, this value is calculated during hyperparameter optimization.

      • feature_processors array[object]

        Advanced configuration option. A collection of feature preprocessors that modify one or more included fields. The analysis uses the resulting one or more features instead of the original document field. However, these features are ephemeral; they are not stored in the destination index. Multiple feature_processors entries can refer to the same document fields. Automatic categorical feature encoding still occurs for the fields that are unprocessed by a custom processor or that have categorical values. Use this property only if you want to override the automatic feature encoding of the specified fields.

      • gamma number

        Advanced configuration option. Regularization parameter to prevent overfitting on the training data set. Multiplies a linear penalty associated with the size of individual trees in the forest. A high gamma value causes training to prefer small trees. A small gamma value results in larger individual trees and slower training. By default, this value is calculated during hyperparameter optimization. It must be a nonnegative value.

      • lambda number

        Advanced configuration option. Regularization parameter to prevent overfitting on the training data set. Multiplies an L2 regularization term which applies to leaf weights of the individual trees in the forest. A high lambda value causes training to favor small leaf weights. This behavior makes the prediction function smoother at the expense of potentially not being able to capture relevant relationships between the features and the dependent variable. A small lambda value results in large individual trees and slower training. By default, this value is calculated during hyperparameter optimization. It must be a nonnegative value.

      • max_optimization_rounds_per_hyperparameter number

        Advanced configuration option. A multiplier responsible for determining the maximum number of hyperparameter optimization steps in the Bayesian optimization procedure. The maximum number of steps is determined based on the number of undefined hyperparameters times the maximum optimization rounds per hyperparameter. By default, this value is calculated during hyperparameter optimization.

      • max_trees number

        Advanced configuration option. Defines the maximum number of decision trees in the forest. The maximum value is 2000. By default, this value is calculated during hyperparameter optimization.

      • num_top_feature_importance_values number

        Advanced configuration option. Specifies the maximum number of feature importance values per document to return. By default, no feature importance calculation occurs.

        Default value is 0.0.

      • prediction_field_name string

        Defines the name of the prediction field in the results. Defaults to <dependent_variable>_prediction.

      • randomize_seed number

        Defines the seed for the random generator that is used to pick training data. By default, it is randomly generated. Set it to a specific value to use the same training data each time you start a job (assuming other related parameters such as source and analyzed_fields are the same).

      • soft_tree_depth_limit number

        Advanced configuration option. Machine learning uses loss guided tree growing, which means that the decision trees grow where the regularized loss decreases most quickly. This soft limit combines with the soft_tree_depth_tolerance to penalize trees that exceed the specified depth; the regularized loss increases quickly beyond this depth. By default, this value is calculated during hyperparameter optimization. It must be greater than or equal to 0.

      • soft_tree_depth_tolerance number

        Advanced configuration option. This option controls how quickly the regularized loss increases when the tree depth exceeds soft_tree_depth_limit. By default, this value is calculated during hyperparameter optimization. It must be greater than or equal to 0.01.

      • training_percent
      • loss_function string

        The loss function used during regression. Available options are mse (mean squared error), msle (mean squared logarithmic error), huber (Pseudo-Huber loss).

        Default value is mse.

      • loss_function_parameter number

        A positive number that is used as a parameter to the loss_function.

  • analyzed_fields object

    Specifies includes and/or excludes patterns to select which fields will be included in the analysis. The patterns specified in excludes are applied last, therefore excludes takes precedence. In other words, if the same field is specified in both includes and excludes, then the field will not be included in the analysis. If analyzed_fields is not set, only the relevant fields will be included. For example, all the numeric fields for outlier detection. The supported fields vary for each type of analysis. Outlier detection requires numeric or boolean data to analyze. The algorithms don’t support missing values therefore fields that have data types other than numeric or boolean are ignored. Documents where included fields contain missing values, null values, or an array are also ignored. Therefore the dest index may contain documents that don’t have an outlier score. Regression supports fields that are numeric, boolean, text, keyword, and ip data types. It is also tolerant of missing values. Fields that are supported are included in the analysis, other fields are ignored. Documents where included fields contain an array with two or more values are also ignored. Documents in the dest index that don’t contain a results field are not included in the regression analysis. Classification supports fields that are numeric, boolean, text, keyword, and ip data types. It is also tolerant of missing values. Fields that are supported are included in the analysis, other fields are ignored. Documents where included fields contain an array with two or more values are also ignored. Documents in the dest index that don’t contain a results field are not included in the classification analysis. Classification analysis can be improved by mapping ordinal variable values to a single number. For example, in case of age ranges, you can model the values as 0-14 = 0, 15-24 = 1, 25-34 = 2, and so on.

    Hide analyzed_fields attributes Show analyzed_fields attributes object
    • includes array[string]

      An array of strings that defines the fields that will be excluded from the analysis. You do not need to add fields with unsupported data types to excludes, these fields are excluded from the analysis automatically.

    • excludes array[string]

      An array of strings that defines the fields that will be included in the analysis.

  • description string

    A description of the job.

  • dest object Required

    The destination configuration.

    Hide dest attributes Show dest attributes object
    • index string Required

      Defines the destination index to store the results of the data frame analytics job.

    • results_field string

      Defines the name of the field in which to store the results of the analysis. Defaults to ml.

  • max_num_threads number

    The maximum number of threads to be used by the analysis. Using more threads may decrease the time necessary to complete the analysis at the cost of using more CPU. Note that the process may use additional threads for operational functionality other than the analysis itself.

    Default value is 1.0.

  • _meta object
    Hide _meta attribute Show _meta attribute object
    • * object Additional properties
  • model_memory_limit string

    The approximate maximum amount of memory resources that are permitted for analytical processing. If your elasticsearch.yml file contains an xpack.ml.max_model_memory_limit setting, an error occurs when you try to create data frame analytics jobs that have model_memory_limit values greater than that setting.

    Default value is 1gb.

  • source object Required

    The configuration of how to source the analysis data.

    Hide source attributes Show source attributes object
    • index string | array[string] Required

      Index or indices on which to perform the analysis. It can be a single index or index pattern as well as an array of indices or patterns. NOTE: If your source indices contain documents with the same IDs, only the document that is indexed last appears in the destination index.

    • query object

      The Elasticsearch query domain-specific language (DSL). This value corresponds to the query object in an Elasticsearch search POST body. All the options that are supported by Elasticsearch can be used, as this object is passed verbatim to Elasticsearch. By default, this property has the following value: {"match_all": {}}.

      External documentation
      Hide query attributes Show query attributes object
      • bool object
      • boosting object
      • common object Deprecated
      • combined_fields object
      • constant_score object
      • dis_max object
      • distance_feature
      • exists object
      • function_score object
      • fuzzy object

        Returns documents that contain terms similar to the search term, as measured by a Levenshtein edit distance.

        External documentation
      • geo_bounding_box object
      • geo_distance object
      • geo_grid object

        Matches geo_point and geo_shape values that intersect a grid cell from a GeoGrid aggregation.

      • geo_polygon object
      • geo_shape object
      • has_child object
      • has_parent object
      • ids object
      • intervals object

        Returns documents based on the order and proximity of matching terms.

        External documentation
      • knn object
      • match object

        Returns documents that match a provided text, number, date or boolean value. The provided text is analyzed before matching.

        External documentation
      • match_all object
      • match_bool_prefix object

        Analyzes its input and constructs a bool query from the terms. Each term except the last is used in a term query. The last term is used in a prefix query.

        External documentation
      • match_none object
      • match_phrase object

        Analyzes the text and creates a phrase query out of the analyzed text.

        External documentation
      • match_phrase_prefix object

        Returns documents that contain the words of a provided text, in the same order as provided. The last term of the provided text is treated as a prefix, matching any words that begin with that term.

        External documentation
      • more_like_this object
      • multi_match object
      • nested object
      • parent_id object
      • percolate object
      • prefix object

        Returns documents that contain a specific prefix in a provided field.

        External documentation
      • query_string object
      • range object

        Returns documents that contain terms within a provided range.

        External documentation
      • rank_feature object
      • regexp object

        Returns documents that contain terms matching a regular expression.

        External documentation
      • rule object
      • script object
      • script_score object
      • semantic object
      • shape object
      • simple_query_string object
      • span_containing object
      • span_field_masking object
      • span_first object
      • span_multi object
      • span_near object
      • span_not object
      • span_or object
      • span_term object

        Matches spans containing a term.

        External documentation
      • span_within object
      • term object

        Returns documents that contain an exact term in a provided field. To return a document, the query term must exactly match the queried field's value, including whitespace and capitalization.

        External documentation
      • terms object
      • terms_set object

        Returns documents that contain a minimum number of exact terms in a provided field. To return a document, a required number of terms must exactly match the field values, including whitespace and capitalization.

        External documentation
      • text_expansion object Deprecated Generally available; Added in 8.8.0

        Uses a natural language processing model to convert the query text into a list of token-weight pairs which are then used in a query against a sparse vector or rank features field.

        External documentation
      • weighted_tokens object Deprecated Generally available; Added in 8.13.0

        Supports returning text_expansion query results by sending in precomputed tokens with the query.

        External documentation
      • wildcard object

        Returns documents that contain terms matching a wildcard pattern.

        External documentation
      • wrapper object
      • type object
    • runtime_mappings object

      Definitions of runtime fields that will become part of the mapping of the destination index.

      Hide runtime_mappings attribute Show runtime_mappings attribute object
      • * object Additional properties
        Hide * attributes Show * attributes object
        • fields object

          For type composite

          Hide fields attribute Show fields attribute object
          • * object Additional properties
        • fetch_fields array[object]

          For type lookup

        • format string

          A custom format for date type runtime fields.

        • input_field string

          For type lookup

        • target_field string

          For type lookup

        • target_index string

          For type lookup

        • script object

          Painless script executed at query time.

        • type string Required

          Field type, which can be: boolean, composite, date, double, geo_point, ip,keyword, long, or lookup.

          Values are boolean, composite, date, double, geo_point, geo_shape, ip, keyword, long, or lookup.

    • _source object

      Specify includes and/or `excludes patterns to select which fields will be present in the destination. Fields that are excluded cannot be included in the analysis.

      Hide _source attributes Show _source attributes object
      • includes array[string]

        An array of strings that defines the fields that will be excluded from the analysis. You do not need to add fields with unsupported data types to excludes, these fields are excluded from the analysis automatically.

      • excludes array[string]

        An array of strings that defines the fields that will be included in the analysis.

  • headers object
  • version string

Responses

  • 200 application/json
    Hide response attributes Show response attributes object
    • authorization object
      Hide authorization attributes Show authorization attributes object
      • api_key object

        If an API key was used for the most recent update to the job, its name and identifier are listed in the response.

        Hide api_key attributes Show api_key attributes object
        • id string Required

          The identifier for the API key.

        • name string Required

          The name of the API key.

      • roles array[string]

        If a user ID was used for the most recent update to the job, its roles at the time of the update are listed in the response.

      • service_account string

        If a service account was used for the most recent update to the job, the account name is listed in the response.

    • allow_lazy_start boolean Required
    • analysis object Required
      Hide analysis attributes Show analysis attributes object
      • classification object

        The configuration information necessary to perform classification.

        Hide classification attributes Show classification attributes object
        • alpha number

          Advanced configuration option. Machine learning uses loss guided tree growing, which means that the decision trees grow where the regularized loss decreases most quickly. This parameter affects loss calculations by acting as a multiplier of the tree depth. Higher alpha values result in shallower trees and faster training times. By default, this value is calculated during hyperparameter optimization. It must be greater than or equal to zero.

        • dependent_variable string Required

          Defines which field of the document is to be predicted. It must match one of the fields in the index being used to train. If this field is missing from a document, then that document will not be used for training, but a prediction with the trained model will be generated for it. It is also known as continuous target variable. For classification analysis, the data type of the field must be numeric (integer, short, long, byte), categorical (ip or keyword), or boolean. There must be no more than 30 different values in this field. For regression analysis, the data type of the field must be numeric.

        • downsample_factor number

          Advanced configuration option. Controls the fraction of data that is used to compute the derivatives of the loss function for tree training. A small value results in the use of a small fraction of the data. If this value is set to be less than 1, accuracy typically improves. However, too small a value may result in poor convergence for the ensemble and so require more trees. By default, this value is calculated during hyperparameter optimization. It must be greater than zero and less than or equal to 1.

        • early_stopping_enabled boolean

          Advanced configuration option. Specifies whether the training process should finish if it is not finding any better performing models. If disabled, the training process can take significantly longer and the chance of finding a better performing model is unremarkable.

          Default value is true.

        • eta number

          Advanced configuration option. The shrinkage applied to the weights. Smaller values result in larger forests which have a better generalization error. However, larger forests cause slower training. By default, this value is calculated during hyperparameter optimization. It must be a value between 0.001 and 1.

        • eta_growth_rate_per_tree number

          Advanced configuration option. Specifies the rate at which eta increases for each new tree that is added to the forest. For example, a rate of 1.05 increases eta by 5% for each extra tree. By default, this value is calculated during hyperparameter optimization. It must be between 0.5 and 2.

        • feature_bag_fraction number

          Advanced configuration option. Defines the fraction of features that will be used when selecting a random bag for each candidate split. By default, this value is calculated during hyperparameter optimization.

        • feature_processors array[object]

          Advanced configuration option. A collection of feature preprocessors that modify one or more included fields. The analysis uses the resulting one or more features instead of the original document field. However, these features are ephemeral; they are not stored in the destination index. Multiple feature_processors entries can refer to the same document fields. Automatic categorical feature encoding still occurs for the fields that are unprocessed by a custom processor or that have categorical values. Use this property only if you want to override the automatic feature encoding of the specified fields.

        • gamma number

          Advanced configuration option. Regularization parameter to prevent overfitting on the training data set. Multiplies a linear penalty associated with the size of individual trees in the forest. A high gamma value causes training to prefer small trees. A small gamma value results in larger individual trees and slower training. By default, this value is calculated during hyperparameter optimization. It must be a nonnegative value.

        • lambda number

          Advanced configuration option. Regularization parameter to prevent overfitting on the training data set. Multiplies an L2 regularization term which applies to leaf weights of the individual trees in the forest. A high lambda value causes training to favor small leaf weights. This behavior makes the prediction function smoother at the expense of potentially not being able to capture relevant relationships between the features and the dependent variable. A small lambda value results in large individual trees and slower training. By default, this value is calculated during hyperparameter optimization. It must be a nonnegative value.

        • max_optimization_rounds_per_hyperparameter number

          Advanced configuration option. A multiplier responsible for determining the maximum number of hyperparameter optimization steps in the Bayesian optimization procedure. The maximum number of steps is determined based on the number of undefined hyperparameters times the maximum optimization rounds per hyperparameter. By default, this value is calculated during hyperparameter optimization.

        • max_trees number

          Advanced configuration option. Defines the maximum number of decision trees in the forest. The maximum value is 2000. By default, this value is calculated during hyperparameter optimization.

        • num_top_feature_importance_values number

          Advanced configuration option. Specifies the maximum number of feature importance values per document to return. By default, no feature importance calculation occurs.

          Default value is 0.0.

        • randomize_seed number

          Defines the seed for the random generator that is used to pick training data. By default, it is randomly generated. Set it to a specific value to use the same training data each time you start a job (assuming other related parameters such as source and analyzed_fields are the same).

        • soft_tree_depth_limit number

          Advanced configuration option. Machine learning uses loss guided tree growing, which means that the decision trees grow where the regularized loss decreases most quickly. This soft limit combines with the soft_tree_depth_tolerance to penalize trees that exceed the specified depth; the regularized loss increases quickly beyond this depth. By default, this value is calculated during hyperparameter optimization. It must be greater than or equal to 0.

        • soft_tree_depth_tolerance number

          Advanced configuration option. This option controls how quickly the regularized loss increases when the tree depth exceeds soft_tree_depth_limit. By default, this value is calculated during hyperparameter optimization. It must be greater than or equal to 0.01.

        • class_assignment_objective string
        • num_top_classes number

          Defines the number of categories for which the predicted probabilities are reported. It must be non-negative or -1. If it is -1 or greater than the total number of categories, probabilities are reported for all categories; if you have a large number of categories, there could be a significant effect on the size of your destination index. NOTE: To use the AUC ROC evaluation method, num_top_classes must be set to -1 or a value greater than or equal to the total number of categories.

          Default value is 2.0.

      • outlier_detection object

        The configuration information necessary to perform outlier detection. NOTE: Advanced parameters are for fine-tuning classification analysis. They are set automatically by hyperparameter optimization to give the minimum validation error. It is highly recommended to use the default values unless you fully understand the function of these parameters.

        Hide outlier_detection attributes Show outlier_detection attributes object
        • compute_feature_influence boolean

          Specifies whether the feature influence calculation is enabled.

          Default value is true.

        • feature_influence_threshold number

          The minimum outlier score that a document needs to have in order to calculate its feature influence score. Value range: 0-1.

          Default value is 0.1.

        • method string

          The method that outlier detection uses. Available methods are lof, ldof, distance_kth_nn, distance_knn, and ensemble. The default value is ensemble, which means that outlier detection uses an ensemble of different methods and normalises and combines their individual outlier scores to obtain the overall outlier score.

          Default value is ensemble.

        • n_neighbors number

          Defines the value for how many nearest neighbors each method of outlier detection uses to calculate its outlier score. When the value is not set, different values are used for different ensemble members. This default behavior helps improve the diversity in the ensemble; only override it if you are confident that the value you choose is appropriate for the data set.

        • outlier_fraction number

          The proportion of the data set that is assumed to be outlying prior to outlier detection. For example, 0.05 means it is assumed that 5% of values are real outliers and 95% are inliers.

        • standardization_enabled boolean

          If true, the following operation is performed on the columns before computing outlier scores: (x_i - mean(x_i)) / sd(x_i).

          Default value is true.

      • regression object

        The configuration information necessary to perform regression. NOTE: Advanced parameters are for fine-tuning regression analysis. They are set automatically by hyperparameter optimization to give the minimum validation error. It is highly recommended to use the default values unless you fully understand the function of these parameters.

        Hide regression attributes Show regression attributes object
        • alpha number

          Advanced configuration option. Machine learning uses loss guided tree growing, which means that the decision trees grow where the regularized loss decreases most quickly. This parameter affects loss calculations by acting as a multiplier of the tree depth. Higher alpha values result in shallower trees and faster training times. By default, this value is calculated during hyperparameter optimization. It must be greater than or equal to zero.

        • dependent_variable string Required

          Defines which field of the document is to be predicted. It must match one of the fields in the index being used to train. If this field is missing from a document, then that document will not be used for training, but a prediction with the trained model will be generated for it. It is also known as continuous target variable. For classification analysis, the data type of the field must be numeric (integer, short, long, byte), categorical (ip or keyword), or boolean. There must be no more than 30 different values in this field. For regression analysis, the data type of the field must be numeric.

        • downsample_factor number

          Advanced configuration option. Controls the fraction of data that is used to compute the derivatives of the loss function for tree training. A small value results in the use of a small fraction of the data. If this value is set to be less than 1, accuracy typically improves. However, too small a value may result in poor convergence for the ensemble and so require more trees. By default, this value is calculated during hyperparameter optimization. It must be greater than zero and less than or equal to 1.

        • early_stopping_enabled boolean

          Advanced configuration option. Specifies whether the training process should finish if it is not finding any better performing models. If disabled, the training process can take significantly longer and the chance of finding a better performing model is unremarkable.

          Default value is true.

        • eta number

          Advanced configuration option. The shrinkage applied to the weights. Smaller values result in larger forests which have a better generalization error. However, larger forests cause slower training. By default, this value is calculated during hyperparameter optimization. It must be a value between 0.001 and 1.

        • eta_growth_rate_per_tree number

          Advanced configuration option. Specifies the rate at which eta increases for each new tree that is added to the forest. For example, a rate of 1.05 increases eta by 5% for each extra tree. By default, this value is calculated during hyperparameter optimization. It must be between 0.5 and 2.

        • feature_bag_fraction number

          Advanced configuration option. Defines the fraction of features that will be used when selecting a random bag for each candidate split. By default, this value is calculated during hyperparameter optimization.

        • feature_processors array[object]

          Advanced configuration option. A collection of feature preprocessors that modify one or more included fields. The analysis uses the resulting one or more features instead of the original document field. However, these features are ephemeral; they are not stored in the destination index. Multiple feature_processors entries can refer to the same document fields. Automatic categorical feature encoding still occurs for the fields that are unprocessed by a custom processor or that have categorical values. Use this property only if you want to override the automatic feature encoding of the specified fields.

        • gamma number

          Advanced configuration option. Regularization parameter to prevent overfitting on the training data set. Multiplies a linear penalty associated with the size of individual trees in the forest. A high gamma value causes training to prefer small trees. A small gamma value results in larger individual trees and slower training. By default, this value is calculated during hyperparameter optimization. It must be a nonnegative value.

        • lambda number

          Advanced configuration option. Regularization parameter to prevent overfitting on the training data set. Multiplies an L2 regularization term which applies to leaf weights of the individual trees in the forest. A high lambda value causes training to favor small leaf weights. This behavior makes the prediction function smoother at the expense of potentially not being able to capture relevant relationships between the features and the dependent variable. A small lambda value results in large individual trees and slower training. By default, this value is calculated during hyperparameter optimization. It must be a nonnegative value.

        • max_optimization_rounds_per_hyperparameter number

          Advanced configuration option. A multiplier responsible for determining the maximum number of hyperparameter optimization steps in the Bayesian optimization procedure. The maximum number of steps is determined based on the number of undefined hyperparameters times the maximum optimization rounds per hyperparameter. By default, this value is calculated during hyperparameter optimization.

        • max_trees number

          Advanced configuration option. Defines the maximum number of decision trees in the forest. The maximum value is 2000. By default, this value is calculated during hyperparameter optimization.

        • num_top_feature_importance_values number

          Advanced configuration option. Specifies the maximum number of feature importance values per document to return. By default, no feature importance calculation occurs.

          Default value is 0.0.

        • randomize_seed number

          Defines the seed for the random generator that is used to pick training data. By default, it is randomly generated. Set it to a specific value to use the same training data each time you start a job (assuming other related parameters such as source and analyzed_fields are the same).

        • soft_tree_depth_limit number

          Advanced configuration option. Machine learning uses loss guided tree growing, which means that the decision trees grow where the regularized loss decreases most quickly. This soft limit combines with the soft_tree_depth_tolerance to penalize trees that exceed the specified depth; the regularized loss increases quickly beyond this depth. By default, this value is calculated during hyperparameter optimization. It must be greater than or equal to 0.

        • soft_tree_depth_tolerance number

          Advanced configuration option. This option controls how quickly the regularized loss increases when the tree depth exceeds soft_tree_depth_limit. By default, this value is calculated during hyperparameter optimization. It must be greater than or equal to 0.01.

        • loss_function string

          The loss function used during regression. Available options are mse (mean squared error), msle (mean squared logarithmic error), huber (Pseudo-Huber loss).

          Default value is mse.

        • loss_function_parameter number

          A positive number that is used as a parameter to the loss_function.

    • analyzed_fields object
      Hide analyzed_fields attributes Show analyzed_fields attributes object
      • includes array[string]

        An array of strings that defines the fields that will be excluded from the analysis. You do not need to add fields with unsupported data types to excludes, these fields are excluded from the analysis automatically.

      • excludes array[string]

        An array of strings that defines the fields that will be included in the analysis.

    • create_time number

      Time unit for milliseconds

    • description string
    • dest object Required
      Hide dest attributes Show dest attributes object
      • index string Required

        Defines the destination index to store the results of the data frame analytics job.

      • results_field string

        Defines the name of the field in which to store the results of the analysis. Defaults to ml.

    • id string Required
    • max_num_threads number Required
    • _meta object
      Hide _meta attribute Show _meta attribute object
      • * object Additional properties
    • model_memory_limit string Required
    • source object Required
      Hide source attributes Show source attributes object
      • index string | array[string] Required

        Index or indices on which to perform the analysis. It can be a single index or index pattern as well as an array of indices or patterns. NOTE: If your source indices contain documents with the same IDs, only the document that is indexed last appears in the destination index.

      • query object

        The Elasticsearch query domain-specific language (DSL). This value corresponds to the query object in an Elasticsearch search POST body. All the options that are supported by Elasticsearch can be used, as this object is passed verbatim to Elasticsearch. By default, this property has the following value: {"match_all": {}}.

        External documentation
        Hide query attributes Show query attributes object
        • common object Deprecated
        • distance_feature
        • fuzzy object

          Returns documents that contain terms similar to the search term, as measured by a Levenshtein edit distance.

          External documentation
        • geo_grid object

          Matches geo_point and geo_shape values that intersect a grid cell from a GeoGrid aggregation.

        • intervals object

          Returns documents based on the order and proximity of matching terms.

          External documentation
        • match object

          Returns documents that match a provided text, number, date or boolean value. The provided text is analyzed before matching.

          External documentation
        • match_bool_prefix object

          Analyzes its input and constructs a bool query from the terms. Each term except the last is used in a term query. The last term is used in a prefix query.

          External documentation
        • match_phrase object

          Analyzes the text and creates a phrase query out of the analyzed text.

          External documentation
        • match_phrase_prefix object

          Returns documents that contain the words of a provided text, in the same order as provided. The last term of the provided text is treated as a prefix, matching any words that begin with that term.

          External documentation
        • prefix object

          Returns documents that contain a specific prefix in a provided field.

          External documentation
        • range object

          Returns documents that contain terms within a provided range.

          External documentation
        • regexp object

          Returns documents that contain terms matching a regular expression.

          External documentation
        • span_term object

          Matches spans containing a term.

          External documentation
        • term object

          Returns documents that contain an exact term in a provided field. To return a document, the query term must exactly match the queried field's value, including whitespace and capitalization.

          External documentation
        • terms_set object

          Returns documents that contain a minimum number of exact terms in a provided field. To return a document, a required number of terms must exactly match the field values, including whitespace and capitalization.

          External documentation
        • text_expansion object Deprecated Generally available; Added in 8.8.0

          Uses a natural language processing model to convert the query text into a list of token-weight pairs which are then used in a query against a sparse vector or rank features field.

          External documentation
        • weighted_tokens object Deprecated Generally available; Added in 8.13.0

          Supports returning text_expansion query results by sending in precomputed tokens with the query.

          External documentation
        • wildcard object

          Returns documents that contain terms matching a wildcard pattern.

          External documentation
      • runtime_mappings object

        Definitions of runtime fields that will become part of the mapping of the destination index.

        Hide runtime_mappings attribute Show runtime_mappings attribute object
        • * object Additional properties
          Hide * attributes Show * attributes object
          • fields object

            For type composite

          • fetch_fields array[object]

            For type lookup

          • format string

            A custom format for date type runtime fields.

      • _source object

        Specify includes and/or `excludes patterns to select which fields will be present in the destination. Fields that are excluded cannot be included in the analysis.

        Hide _source attributes Show _source attributes object
        • includes array[string]

          An array of strings that defines the fields that will be excluded from the analysis. You do not need to add fields with unsupported data types to excludes, these fields are excluded from the analysis automatically.

        • excludes array[string]

          An array of strings that defines the fields that will be included in the analysis.

    • version string Required
PUT /_ml/data_frame/analytics/{id}
curl \
 --request PUT 'http://api.example.com/_ml/data_frame/analytics/{id}' \
 --header "Content-Type: application/json" \
 --data '"{\n  \"source\": {\n    \"index\": [\n      \"kibana_sample_data_flights\"\n    ],\n    \"query\": {\n      \"range\": {\n        \"DistanceKilometers\": {\n          \"gt\": 0\n        }\n      }\n    },\n    \"_source\": {\n      \"includes\": [],\n      \"excludes\": [\n        \"FlightDelay\",\n        \"FlightDelayType\"\n      ]\n    }\n  },\n  \"dest\": {\n    \"index\": \"df-flight-delays\",\n    \"results_field\": \"ml-results\"\n  },\n  \"analysis\": {\n  \"regression\": {\n    \"dependent_variable\": \"FlightDelayMin\",\n    \"training_percent\": 90\n    }\n  },\n  \"analyzed_fields\": {\n    \"includes\": [],\n    \"excludes\": [\n      \"FlightNum\"\n    ]\n  },\n  \"model_memory_limit\": \"100mb\"\n}"'
Request example
An example body for a `PUT _ml/data_frame/analytics/model-flight-delays-pre` request.
{
  "source": {
    "index": [
      "kibana_sample_data_flights"
    ],
    "query": {
      "range": {
        "DistanceKilometers": {
          "gt": 0
        }
      }
    },
    "_source": {
      "includes": [],
      "excludes": [
        "FlightDelay",
        "FlightDelayType"
      ]
    }
  },
  "dest": {
    "index": "df-flight-delays",
    "results_field": "ml-results"
  },
  "analysis": {
  "regression": {
    "dependent_variable": "FlightDelayMin",
    "training_percent": 90
    }
  },
  "analyzed_fields": {
    "includes": [],
    "excludes": [
      "FlightNum"
    ]
  },
  "model_memory_limit": "100mb"
}