How a data frame analytics job worksedit

A data frame analytics job is essentially a persistent Elasticsearch task. During its life cycle, it goes through four phases:

  • reindexing,
  • loading data,
  • analyzing,
  • writing results.

Let’s take a look at the phases one-by-one.

Reindexingedit

During the reindexing phase the documents from the source index or indices are copied to the destination index. If you want to define settings or mappings, create the index before you start the job. Otherwise, the job creates it using default settings.

Once the destination index is built, the data frame analytics job task calls the Elasticsearch Reindex API to launch the reindexing task.

Loading dataedit

After the reindexing is finished, the job fetches the needed data from the destination index. It converts the data into the format that the analysis process expects, then sends it to the analysis process.

Analyzingedit

When all the required data is loaded, the analyzing phase begins. The exact process always depends on the type of data frame analytics – for example, classification or outlier detection – that is executed. This is the phase when the job generates a machine learning model for analyzing the dataset.

Writing resultsedit

After the loaded data is analyzed, the analysis process sends back the results. Only the additional fields that the analysis calculated are written back, the ones that have been loaded in the loading data phase are not. The data frame analytics job matches the results with the data rows in the destination index, merges them, and indexes them back to the destination index. When the process is complete, the task is marked as completed and the data frame analytics job stops. Your data is ready to be evaluated.

Check the Concepts section if you’d like to know more about the various data frame analytics types.

Check the Evaluating data frame analytics section if you are interested in the evaluation of the data frame analytics results.