WARNING: Deprecated in 7.15.0.

The Java REST Client is deprecated in favor of the Java API Client.

« Create calendars API Create datafeeds API »

› › ›

Create data frame analytics jobs API

IMPORTANT: This documentation is no longer updated. Refer to Elastic's version policy and the latest documentation.

Create data frame analytics jobs API

Creates a new data frame analytics job. The API accepts a PutDataFrameAnalyticsRequest object as a request and returns a PutDataFrameAnalyticsResponse.

Request

edit

A PutDataFrameAnalyticsRequest requires the following argument:

PutDataFrameAnalyticsRequest request = new PutDataFrameAnalyticsRequest(config);

The configuration of the data frame analytics job to create

Data frame analytics configuration

edit

The DataFrameAnalyticsConfig object contains all the details about the data frame analytics job configuration and contains the following arguments:

DataFrameAnalyticsConfig config = DataFrameAnalyticsConfig.builder()
    .setId("my-analytics-config") 
    .setSource(sourceConfig) 
    .setDest(destConfig) 
    .setAnalysis(outlierDetection) 
    .setAnalyzedFields(analyzedFields) 
    .setModelMemoryLimit(new ByteSizeValue(5, ByteSizeUnit.MB)) 
    .setDescription("this is an example description") 
    .setMaxNumThreads(1) 
    .build();

	The data frame analytics job ID
	The source index and query from which to gather data
	The destination index
	The analysis to be performed
	The fields to be included in / excluded from the analysis
	The memory limit for the model created as part of the analysis process
	Optionally, a human-readable description
	The maximum number of threads to be used by the analysis. Defaults to 1.

SourceConfig

edit

The index and the query from which to collect data.

DataFrameAnalyticsSource sourceConfig = DataFrameAnalyticsSource.builder() 
    .setIndex("put-test-source-index") 
    .setQueryConfig(queryConfig) 
    .setRuntimeMappings(runtimeMappings) 
    .setSourceFiltering(new FetchSourceContext(true,
        new String[] { "included_field_1", "included_field_2" },
        new String[] { "excluded_field" })) 
    .build();

	Constructing a new DataFrameAnalyticsSource
	The source index
	The query from which to gather the data. If query is not set, a `match_all` query is used by default.
	Runtime mappings that will be added to the destination index mapping.
	Source filtering to select which fields will exist in the destination index.

QueryConfig

edit

The query with which to select data from the source.

QueryConfig queryConfig = new QueryConfig(new MatchAllQueryBuilder());

DestinationConfig

edit

The index to which data should be written by the data frame analytics job.

DataFrameAnalyticsDest destConfig = DataFrameAnalyticsDest.builder() 
    .setIndex("put-test-dest-index") 
    .build();

	Constructing a new DataFrameAnalyticsDest
	The destination index

Analysis

edit

The analysis to be performed. Currently, the supported analyses include: OutlierDetection, Classification, Regression.

Outlier detection

edit

OutlierDetection analysis can be created in one of two ways:

DataFrameAnalysis outlierDetection = org.elasticsearch.client.ml.dataframe.OutlierDetection.createDefault();

Constructing a new OutlierDetection object with default strategy to determine outliers

DataFrameAnalysis outlierDetectionCustomized = org.elasticsearch.client.ml.dataframe.OutlierDetection.builder() 
    .setMethod(org.elasticsearch.client.ml.dataframe.OutlierDetection.Method.DISTANCE_KNN) 
    .setNNeighbors(5) 
    .setFeatureInfluenceThreshold(0.1) 
    .setComputeFeatureInfluence(true) 
    .setOutlierFraction(0.05) 
    .setStandardizationEnabled(true) 
    .build();

	Constructing a new OutlierDetection object
	The method used to perform the analysis
	Number of neighbors taken into account during analysis
	The min `outlier_score` required to compute feature influence
	Whether to compute feature influence
	The proportion of the data set that is assumed to be outlying prior to outlier detection
	Whether to apply standardization to feature values

Classification

edit

Classification analysis requires to set which is the dependent_variable and has a number of other optional parameters:

DataFrameAnalysis classification = Classification.builder("my_dependent_variable") 
    .setLambda(1.0) 
    .setGamma(5.5) 
    .setEta(5.5) 
    .setMaxTrees(50) 
    .setFeatureBagFraction(0.4) 
    .setNumTopFeatureImportanceValues(3) 
    .setPredictionFieldName("my_prediction_field_name") 
    .setTrainingPercent(50.0) 
    .setRandomizeSeed(1234L) 
    .setClassAssignmentObjective(Classification.ClassAssignmentObjective.MAXIMIZE_ACCURACY) 
    .setNumTopClasses(1) 
    .setFeatureProcessors(Arrays.asList(OneHotEncoding.builder("categorical_feature") 
        .addOneHot("cat", "cat_column")
        .build()))
    .setAlpha(1.0) 
    .setEtaGrowthRatePerTree(1.0) 
    .setSoftTreeDepthLimit(1.0) 
    .setSoftTreeDepthTolerance(1.0) 
    .setDownsampleFactor(0.5) 
    .setMaxOptimizationRoundsPerHyperparameter(3) 
    .setEarlyStoppingEnabled(true) 
    .build();

	Constructing a new Classification builder object with the required dependent variable
	The lambda regularization parameter. A non-negative double.
	The gamma regularization parameter. A non-negative double.
	The applied shrinkage. A double in [0.001, 1].
	The maximum number of trees the forest is allowed to contain. An integer in [1, 2000].
	The fraction of features which will be used when selecting a random bag for each candidate split. A double in (0, 1].
	If set, feature importance for the top most important features will be computed.
	The name of the prediction field in the results object.
	The percentage of training-eligible rows to be used in training. Defaults to 100%.
	The seed to be used by the random generator that picks which rows are used in training.
	The optimization objective to target when assigning class labels. Defaults to maximize_minimum_recall.
	The number of top classes (or -1 which denotes all classes) to be reported in the results. Defaults to 2.
	Custom feature processors that will create new features for analysis from the included document fields. Note, automatic categorical feature encoding still occurs for all features.
	The alpha regularization parameter. A non-negative double.
	The growth rate of the shrinkage parameter. A double in [0.5, 2.0].
	The soft tree depth limit. A non-negative double.
	The soft tree depth tolerance. Controls how much the soft tree depth limit is respected. A double greater than or equal to 0.01.
	The amount by which to downsample the data for stochastic gradient estimates. A double in (0, 1.0].
	The maximum number of optimisation rounds we use for hyperparameter optimisation per parameter. An integer in [0, 20].
	Whether to enable early stopping to finish training process if it is not finding better models.

Regression

edit

Regression analysis requires to set which is the dependent_variable and has a number of other optional parameters:

DataFrameAnalysis regression = org.elasticsearch.client.ml.dataframe.Regression.builder("my_dependent_variable") 
    .setLambda(1.0) 
    .setGamma(5.5) 
    .setEta(5.5) 
    .setMaxTrees(50) 
    .setFeatureBagFraction(0.4) 
    .setNumTopFeatureImportanceValues(3) 
    .setPredictionFieldName("my_prediction_field_name") 
    .setTrainingPercent(50.0) 
    .setRandomizeSeed(1234L) 
    .setLossFunction(Regression.LossFunction.MSE) 
    .setLossFunctionParameter(1.0) 
    .setFeatureProcessors(Arrays.asList(OneHotEncoding.builder("categorical_feature") 
        .addOneHot("cat", "cat_column")
        .build()))
    .setAlpha(1.0) 
    .setEtaGrowthRatePerTree(1.0) 
    .setSoftTreeDepthLimit(1.0) 
    .setSoftTreeDepthTolerance(1.0) 
    .setDownsampleFactor(0.5) 
    .setMaxOptimizationRoundsPerHyperparameter(3) 
    .setEarlyStoppingEnabled(true) 
    .build();

	Constructing a new Regression builder object with the required dependent variable
	The lambda regularization parameter. A non-negative double.
	The gamma regularization parameter. A non-negative double.
	The applied shrinkage. A double in [0.001, 1].
	The maximum number of trees the forest is allowed to contain. An integer in [1, 2000].
	The fraction of features which will be used when selecting a random bag for each candidate split. A double in (0, 1].
	If set, feature importance for the top most important features will be computed.
	The name of the prediction field in the results object.
	The percentage of training-eligible rows to be used in training. Defaults to 100%.
	The seed to be used by the random generator that picks which rows are used in training.
	The loss function used for regression. Defaults to `mse`.
	An optional parameter to the loss function.
	Custom feature processors that will create new features for analysis from the included document fields. Note, automatic categorical feature encoding still occurs for all features.
	The alpha regularization parameter. A non-negative double.
	The growth rate of the shrinkage parameter. A double in [0.5, 2.0].
	The soft tree depth limit. A non-negative double.
	The soft tree depth tolerance. Controls how much the soft tree depth limit is respected. A double greater than or equal to 0.01.
	The amount by which to downsample the data for stochastic gradient estimates. A double in (0, 1.0].
	The maximum number of optimisation rounds we use for hyperparameter optimisation per parameter. An integer in [0, 20].
	Whether to enable early stopping to finish training process if it is not finding better models.

Analyzed fields

edit

FetchContext object containing fields to be included in / excluded from the analysis

FetchSourceContext analyzedFields =
    new FetchSourceContext(
        true,
        new String[] { "included_field_1", "included_field_2" },
        new String[] { "excluded_field" });

Synchronous execution

edit

When executing a PutDataFrameAnalyticsRequest in the following manner, the client waits for the PutDataFrameAnalyticsResponse to be returned before continuing with code execution:

PutDataFrameAnalyticsResponse response = client.machineLearning().putDataFrameAnalytics(request, RequestOptions.DEFAULT);

Synchronous calls may throw an IOException in case of either failing to parse the REST response in the high-level REST client, the request times out or similar cases where there is no response coming back from the server.

In cases where the server returns a 4xx or 5xx error code, the high-level client tries to parse the response body error details instead and then throws a generic ElasticsearchException and adds the original ResponseException as a suppressed exception to it.

Asynchronous execution

edit

Executing a PutDataFrameAnalyticsRequest can also be done in an asynchronous fashion so that the client can return directly. Users need to specify how the response or potential failures will be handled by passing the request and a listener to the asynchronous put-data-frame-analytics method:

client.machineLearning().putDataFrameAnalyticsAsync(request, RequestOptions.DEFAULT, listener);

The PutDataFrameAnalyticsRequest to execute and the ActionListener to use when the execution completes

The asynchronous method does not block and returns immediately. Once it is completed the ActionListener is called back using the onResponse method if the execution successfully completed or using the onFailure method if it failed. Failure scenarios and expected exceptions are the same as in the synchronous execution case.

A typical listener for put-data-frame-analytics looks like:

ActionListener<PutDataFrameAnalyticsResponse> listener = new ActionListener<PutDataFrameAnalyticsResponse>() {
    @Override
    public void onResponse(PutDataFrameAnalyticsResponse response) {
        
    }

    @Override
    public void onFailure(Exception e) {
        
    }
};

	Called when the execution is successfully completed.
	Called when the whole `PutDataFrameAnalyticsRequest` fails.

Response

edit

The returned PutDataFrameAnalyticsResponse contains the newly created data frame analytics job.

DataFrameAnalyticsConfig createdConfig = response.getConfig();

« Create calendars API Create datafeeds API »