Working with data frame analytics at scale
editWorking with data frame analytics at scale
editA data frame analytics job has numerous configuration options. Some of them may have a significant effect on the time taken to train a model. The training time depends on various factors, like the statistical characteristics of your data, the number of provided hyperparameters, the number of features included in the analysis, the hardware you use, and so on. This guide contains a list of considerations to help you plan for training data frame analytics models at scale and optimizing training time.
In this guide, you’ll learn how to:
- Understand the impact of configuration options on the time taken to train models for data frame analytics jobs.
Prerequisites:
This guide assumes you’re already familiar with:
- How to create data frame analytics jobs. If not, refer to Overview.
- How data frame analytics jobs work. If not, refer to How data frame analytics jobs work.
It is important to note that there is a correlation between the training time, the complexity of the model, the size of the data, and the quality of the analysis results. Improvements in quality, however, are not linear with the amount of training data; for very large source data, it might take hours to train a model for very small gains in quality. When you work at scale with data frame analytics, you need to decide what quality of results is acceptable for your use case. When you have determined your acceptance criteria, you have a better picture of the factors you can trade off while still achieving your goal.
The following recommendations are not sequential – the numbers just help to navigate between the list items; you can take action on one or more of them in any order.
0. Start small and iterate rapidly
editTraining is an iterative process. Experiment with different settings and configuration options (including but not limited to hyperparameters and feature importance), then evaluate the results and decide whether they are good enough or need further experimentation.
Every iteration takes time, so it is useful to start with a small set of data so you can iterate rapidly and then build up from here.
1. Set a small training percent
edit(This step only applies to regression and classification jobs.)
The number of documents used for training a model has an effect on the training time. A higher training percent means a longer training time.
Consider starting with a small percentage of training data so you can complete iterations more quickly. Once you are happy with your configuration, increase the training percent. As a rule of thumb, if you have a data set with more than 100,000 data points, start with a training percent of 5 or 10.
2. Disable feature importance calculation
edit(This step only applies to regression and classification jobs.)
Feature importance indicates which fields had the biggest impact on each prediction that is generated by the analysis. Depending on the size of the data set, feature importance can take a long time to compute.
For a shorter runtime, consider disabling feature importance for some or all iterations if you do not require it.
3. Optimize the number of included fields
editYou can speed up runtime by only analyzing relevant fields.
By default, all the fields that are supported by the analysis type are included in the analysis. In general, more fields analyzed requires more resources and longer training times, including the time taken for automatic feature selection. To reduce training time, consider limiting the scope of the analysis to the relevant fields that contribute to the prediction. You may do this by either excluding non-relevant fields or by including relevant ones.
Feature importance can help you determine the fields that contribute most to the prediction. However, as calculating feature importance increases training time, this is a trade-off that can be evaluated during an iterative training process.
4. Increase the maximum number of threads
editYou can set the maximum number of threads that are used during the analysis. The
default value of max_num_threads
is 1. Depending on the characteristics of the
data, using more threads may decrease the training time at the cost of increased
CPU usage. Note that trying to use more threads than the number of CPU cores has
no advantage.
Hyperparameter optimization and calculating feature importance gain the most benefit
from the increased number of threads. This can be seen in phases
coarse_parameter_search
, fine_tuning_parameters
, and writing_results
. The
rest of the phases are not affected by the increased number of threads.
To learn more about the individual phases, refer to How data frame analytics jobs work.
If your machine learning nodes are running concurrent anomaly detection or data frame analytics jobs, then you may want to keep the maximum number of threads set to a low number – for example the default 1 – to prevent jobs competing for resources.
5. Optimize the size of the source index
editEven if the training percent is low, reindexing the source index – which is a mandatory step in the job creation process – may take a long time. During reindexing, the documents from the source index or indices are copied to the destination index, so you have a static copy of the analyzed data.
If your data is large and you do not need to test and train on the whole source index or indices, then reduce the cost of reindexing by using a subset of your source data. This can be done by either defining a filter for the source index in the data frame analytics job configuration, or by manually reindexing a subset of this data to use as an alternate source index.
6. Configure hyperparameters
edit(This step only applies to regression and classification jobs.)
Hyperparameter optimization is the most complicated mathematical process during model training and may take a long time.
By default, optimized hyperparameter values are chosen automatically. It is possible to reduce the time taken at this step by manually configuring hyperparameters – if you fully understand the purpose of the hyperparameters and have a sensible value for any or all of them. This reduces the computing load and therefore decreases training time.