How data frame transform checkpoints workedit

Warning

This functionality is in beta and is subject to change. The design and code is less mature than official GA features and is being provided as-is with no warranties. Beta features are not subject to the support SLA of official GA features.

Each time a data frame transform examines the source indices and creates or updates the destination index, it generates a checkpoint.

If your data frame transform runs only once, there is logically only one checkpoint. If your data frame transform runs continuously, however, it creates checkpoints as it ingests and transforms new source data.

To create a checkpoint, the continuous data frame transform:

  1. Checks for changes to source indices.

    Using a simple periodic timer, the data frame transform checks for changes to the source indices. This check is done based on the interval defined in the transform’s frequency property.

    If the source indices remain unchanged or if a checkpoint is already in progress then it waits for the next timer.

  2. Identifies which entities have changed.

    The data frame transform searches to see which entities have changed since the last time it checked. The transform’s sync configuration object identifies a time field in the source indices. The transform uses the values in that field to synchronize the source and destination indices.

  3. Updates the destination index (the data frame) with the changed entities.

    The data frame transform applies changes related to either new or changed entities to the destination index. The set of changed entities is paginated. For each page, the data frame transform performs a composite aggregation using a terms query. After all the pages of changes have been applied, the checkpoint is complete.

This checkpoint process involves both search and indexing activity on the cluster. We have attempted to favor control over performance while developing data frame transforms. We decided it was preferable for the data frame transform to take longer to complete, rather than to finish quickly and take precedence in resource consumption. That being said, the cluster still requires enough resources to support both the composite aggregation search and the indexing of its results.

Tip

If the cluster experiences unsuitable performance degradation due to the data frame transform, stop the transform. Consider whether you can apply a source query to the data frame transform to reduce the scope of data it processes. Also consider whether the cluster has sufficient resources in place to support both the composite aggregation search and the indexing of its results.

Error handlingedit

Failures in data frame transforms tend to be related to searching or indexing. To increase the resiliency of data frame transforms, the cursor positions of the aggregated search and the changed entities search are tracked in memory and persisted periodically.

Checkpoint failures can be categorized as follows:

  • Temporary failures: The checkpoint is retried. If 10 consecutive failures occur, the data frame transform has a failed status. For example, this situation might occur when there are shard failures and queries return only partial results.
  • Irrecoverable failures: The data frame transform immediately fails. For example, this situation occurs when the source index is not found.
  • Adjustment failures: The data frame transform retries with adjusted settings. For example, if a parent circuit breaker memory errors occur during the composite aggregation, the transform receives partial results. The aggregated search is retried with a smaller number of buckets. This retry is performed at the interval defined in the transform’s frequency property. If the search is retried to the point where it reaches a minimal number of buckets, an irrecoverable failure occurs.

If the node running the data frame transforms fails, the transform restarts from the most recent persisted cursor position. This recovery process might repeat some of the work the transform had already done, but it ensures data consistency.