Sampler aggregation

edit

Sampler aggregation

edit

A filtering aggregation used to limit any sub aggregations' processing to a sample of the top-scoring documents.

Example use cases:

Tightening the focus of analytics to high-relevance matches rather than the potentially very long tail of low-quality matches
Reducing the running cost of aggregations that can produce useful results using only samples e.g. significant_terms

Example:

A query on StackOverflow data for the popular term javascript OR the rarer term kibana will match many documents - most of them missing the word Kibana. To focus the significant_terms aggregation on top-scoring documents that are more likely to match the most interesting parts of our query we use a sample.

resp = client.search(
    index="stackoverflow",
    size="0",
    query={
        "query_string": {
            "query": "tags:kibana OR tags:javascript"
        }
    },
    aggs={
        "sample": {
            "sampler": {
                "shard_size": 200
            },
            "aggs": {
                "keywords": {
                    "significant_terms": {
                        "field": "tags",
                        "exclude": [
                            "kibana",
                            "javascript"
                        ]
                    }
                }
            }
        }
    },
)
print(resp)

response = client.search(
  index: 'stackoverflow',
  size: 0,
  body: {
    query: {
      query_string: {
        query: 'tags:kibana OR tags:javascript'
      }
    },
    aggregations: {
      sample: {
        sampler: {
          shard_size: 200
        },
        aggregations: {
          keywords: {
            significant_terms: {
              field: 'tags',
              exclude: [
                'kibana',
                'javascript'
              ]
            }
          }
        }
      }
    }
  }
)
puts response

const response = await client.search({
  index: "stackoverflow",
  size: 0,
  query: {
    query_string: {
      query: "tags:kibana OR tags:javascript",
    },
  },
  aggs: {
    sample: {
      sampler: {
        shard_size: 200,
      },
      aggs: {
        keywords: {
          significant_terms: {
            field: "tags",
            exclude: ["kibana", "javascript"],
          },
        },
      },
    },
  },
});
console.log(response);

POST /stackoverflow/_search?size=0
{
  "query": {
    "query_string": {
      "query": "tags:kibana OR tags:javascript"
    }
  },
  "aggs": {
    "sample": {
      "sampler": {
        "shard_size": 200
      },
      "aggs": {
        "keywords": {
          "significant_terms": {
            "field": "tags",
            "exclude": [ "kibana", "javascript" ]
          }
        }
      }
    }
  }
}

Response:

{
  ...
  "aggregations": {
    "sample": {
      "doc_count": 200, 
      "keywords": {
        "doc_count": 200,
        "bg_count": 650,
        "buckets": [
          {
            "key": "elasticsearch",
            "doc_count": 150,
            "score": 1.078125,
            "bg_count": 200
          },
          {
            "key": "logstash",
            "doc_count": 50,
            "score": 0.5625,
            "bg_count": 50
          }
        ]
      }
    }
  }
}

200 documents were sampled in total. The cost of performing the nested significant_terms aggregation was therefore limited rather than unbounded.

Without the sampler aggregation the request query considers the full "long tail" of low-quality matches and therefore identifies less significant terms such as jquery and angular rather than focusing on the more insightful Kibana-related terms.

resp = client.search(
    index="stackoverflow",
    size="0",
    query={
        "query_string": {
            "query": "tags:kibana OR tags:javascript"
        }
    },
    aggs={
        "low_quality_keywords": {
            "significant_terms": {
                "field": "tags",
                "size": 3,
                "exclude": [
                    "kibana",
                    "javascript"
                ]
            }
        }
    },
)
print(resp)

response = client.search(
  index: 'stackoverflow',
  size: 0,
  body: {
    query: {
      query_string: {
        query: 'tags:kibana OR tags:javascript'
      }
    },
    aggregations: {
      low_quality_keywords: {
        significant_terms: {
          field: 'tags',
          size: 3,
          exclude: [
            'kibana',
            'javascript'
          ]
        }
      }
    }
  }
)
puts response

const response = await client.search({
  index: "stackoverflow",
  size: 0,
  query: {
    query_string: {
      query: "tags:kibana OR tags:javascript",
    },
  },
  aggs: {
    low_quality_keywords: {
      significant_terms: {
        field: "tags",
        size: 3,
        exclude: ["kibana", "javascript"],
      },
    },
  },
});
console.log(response);

POST /stackoverflow/_search?size=0
{
  "query": {
    "query_string": {
      "query": "tags:kibana OR tags:javascript"
    }
  },
  "aggs": {
    "low_quality_keywords": {
      "significant_terms": {
        "field": "tags",
        "size": 3,
        "exclude": [ "kibana", "javascript" ]
      }
    }
  }
}

Response:

{
  ...
  "aggregations": {
    "low_quality_keywords": {
      "doc_count": 600,
      "bg_count": 650,
      "buckets": [
        {
          "key": "angular",
          "doc_count": 200,
          "score": 0.02777,
          "bg_count": 200
        },
        {
          "key": "jquery",
          "doc_count": 200,
          "score": 0.02777,
          "bg_count": 200
        },
        {
          "key": "logstash",
          "doc_count": 50,
          "score": 0.0069,
          "bg_count": 50
        }
      ]
    }
  }
}

shard_size

edit

The shard_size parameter limits how many top-scoring documents are collected in the sample processed on each shard. The default value is 100.

Limitations

edit

Cannot be nested under `breadth_first` aggregations

edit

Being a quality-based filter the sampler aggregation needs access to the relevance score produced for each document. It therefore cannot be nested under a terms aggregation which has the collect_mode switched from the default depth_first mode to breadth_first as this discards scores. In this situation an error will be thrown.

« Reverse nested aggregation Significant terms aggregation »

Sampler aggregation