Finding outliers in the eCommerce sample dataedit

This functionality is in beta and is subject to change. The design and code is less mature than official GA features and is being provided as-is with no warranties. Beta features are not subject to the support SLA of official GA features.

The goal of outlier detection is to find the most unusual documents in an index. Let’s try to detect unusual customer behavior in the eCommerce sample data set.

  1. Verify that your environment is set up properly to use machine learning features. If the Elasticsearch security features are enabled, you need a user that has authority to create and manage data frame analytics jobs. See Setup and security.

    Since we’ll be creating transforms, you also need manage_data_frame_transforms cluster privileges.

  2. Create a transform that generates an entity-centric index with numeric or boolean data to analyze.

    In this example, we’ll use the eCommerce orders sample data and pivot the data such that we get a new index that contains a sales summary for each customer.

    In particular, create a transform that calculates the sum of the products (products.quantity) and the sum of prices (products.taxful_price) in all of the orders, grouped by customer (customer_full_name). Also include a value count aggregation, so that we know how many orders (order_id) exist for each customer.

    You can preview the transform before you create it in Kibana:

    Creating a transform in Kibana

    Alternatively, you can use the preview transform API and the create transform API.

    API example
    POST _data_frame/transforms/_preview
    {
      "source": {
        "index": [
          "kibana_sample_data_ecommerce"
        ]
      },
      "pivot": {
        "group_by": {
          "customer_full_name.keyword": {
            "terms": {
              "field": "customer_full_name.keyword"
            }
          }
        },
        "aggregations": {
          "products.quantity.sum": {
            "sum": {
              "field": "products.quantity"
            }
          },
          "products.taxful_price.sum": {
            "sum": {
              "field": "products.taxful_price"
            }
          },
          "order_id.value_count": {
            "value_count": {
              "field": "order_id"
            }
          }
        }
      }
    }
    
    PUT _data_frame/transforms/ecommerce-customer-sales
    {
      "source": {
        "index": [
          "kibana_sample_data_ecommerce"
        ]
      },
      "pivot": {
        "group_by": {
          "customer_full_name.keyword": {
            "terms": {
              "field": "customer_full_name.keyword"
            }
          }
        },
        "aggregations": {
          "products.quantity.sum": {
            "sum": {
              "field": "products.quantity"
            }
          },
          "products.taxful_price.sum": {
            "sum": {
              "field": "products.taxful_price"
            }
          },
          "order_id.value_count": {
            "value_count": {
              "field": "order_id"
            }
          }
        }
      },
      "description": "E-commerce sales by customer",
      "dest": {
        "index": "ecommerce-customer-sales"
      }
    }

    For more details about creating transforms, see Transforming the eCommerce sample data.

  3. Start the transform.

    Even though resource utilization is automatically adjusted based on the cluster load, a transform increases search and indexing load on your cluster while it runs. If you’re experiencing an excessive load, however, you can stop it.

    You can start, stop, and manage transforms in Kibana. Alternatively, you can use the start transforms API.

    API example
    POST _data_frame/transforms/ecommerce-customer-sales/_start
  4. Create a data frame analytics job to detect outliers in the new entity-centric index.

    There is a wizard for creating data frame analytics jobs on the Machine Learning > Data Frame Analytics page in Kibana:

    Create a data frame analytics job in Kibana

    Alternatively, you can use the create data frame analytics jobs API.

    API example
    PUT _ml/data_frame/analytics/ecommerce
    {
      "source": {
        "index": "ecommerce-customer-sales"
      },
      "dest": {
        "index": "ecommerce-outliers"
      },
      "analysis": {
        "outlier_detection": {
        }
      },
      "analyzed_fields" : {
        "includes" : ["products.quantity.sum","products.taxful_price.sum","order_id.value_count"]
      }
    }
  5. Start the data frame analytics job.

    You can start, stop, and manage data frame analytics jobs on the Machine Learning > Data Frame Analytics page in Kibana. Alternatively, you can use the start data frame analytics jobs and stop data frame analytics jobs APIs.

    API example
    POST _ml/data_frame/analytics/ecommerce/_start
  6. View the results of the outlier detection analysis.

    The data frame analytics job creates an index that contains the original data and outlier scores for each document. The outlier score indicates how different each entity is from other entities.

    In Kibana, you can view the results from the data frame analytics job and sort them on the outlier score:

    View outlier detection results in Kibana

    The ml.outlier score is a value between 0 and 1. The larger the value, the more likely they are to be an outlier.

    In addition to an overall outlier score, each document is annotated with feature influence values for each field. These values add up to 1 and indicate which fields are the most important in deciding whether an entity is an outlier or inlier. For example, the dark shading on the products.taxful_price.sum field for Wagdi Shaw indicates that the sum of the product prices was the most influential feature in determining that Wagdi is an outlier.

    If you want to see the exact feature influence values, you can retrieve them from the index that is associated with your data frame analytics job.

    API example
    GET ecommerce-outliers/_search?q="Wagdi Shaw"

    The search results include the following outlier detection scores:

    ...
    "ml" :{
      "outlier_score" : 0.9653657078742981,
      "feature_influence.products.quantity.sum" : 0.00592468399554491,
      "feature_influence.order_id.value_count" : 0.01975759118795395,
      "feature_influence.products.taxful_price.sum" : 0.974317729473114
    }
    ...

Now that you’ve found unusual behavior in the sample data set, consider how you might apply these steps to other data sets. If you have data that is already marked up with true outliers, you can determine how well the outlier detection algorithms perform by using the evaluate data frame analytics API. See Evaluating data frame analytics.

If you do not want to keep the transform and the data frame analytics job, you can delete them in Kibana or use the delete transform API and delete data frame analytics job API. When you delete transforms and data frame analytics jobs, the destination indices and Kibana index patterns remain.

If you want to see another example of outlier detection in a Jupyter notebook, click here.