December 20, 2013

Elasticsearch's New Aggregations

UPDATE: This article refers to our hosted Elasticsearch offering by an older name, Found. Please note that Found is now known as Elastic Cloud.

Elasticsearch's faceting feature has made it extremely popular not just for realtime search, but also for analytics. With its new aggregations framework, it'll take you even further.

Introduction

Facets come in many forms and can be processed and visualized in many ways. Superfast faceting and filtering underpins usage of search engines for analytical purposes. We are not necessarily interested in finding the top n documents, but instead in what we can learn from filtering sets of results and seeing how e.g. histograms and charts change. The massively popular combination of Elasticsearch and Kibana for data crunching and visualization is an example of this kind of usage.

While facets are great and people are finding all sorts of creative ways of using them, they have some limitations. To deal with these limitations, Elasticsearch has engineered the new aggregations framework, which brings a whole new level of awesome.

Aggregations

Aggregations can retain information about which documents go into which bucket. A “bucketing” aggregator takes a set of documents and outputs one or more buckets. These buckets also have their own sets of documents. Therefore, one aggregations’ output can be used as another one’s input! The resulting possibilities are vast. You can now compose aggregations, creating trees of aggregators. Before we jump into building vast forrests of aggregation trees we’ll start with something simple.

Let’s start by combining aggregators to create something like the existing terms_stats facet. The terms stats facet is equivalent to a combination of terms and stats aggregators. It makes it possible to get statistics about one field per value of another, e.g. price statistics per tag. There is no corresponding terms_stats-aggregation. It’s not necessary, as we can now achieve the same goal by combining terms and stats aggregations. Our query will first perform a terms-aggregation and then perform a stats-aggregation. For every bucket the terms-aggregation produces, a stats-aggregation is run on the documents in that bucket.

Facets vs. aggregations

Buckets vs. Metrics

The stats-aggregation is an example of what’s called a metric aggregator, and sometimes also a “calculator”. Example metric aggregators are min, max, sum, avg, value_count, stats, and extended_stats. These take a set of documents as input and crunch some numbers for the provided field. The output of metric aggregators does not include any information linked to individual documents, just the statistical data. i.e. metric aggregators can calculate that the minimum price is 10, but will not tell you which documents have that value.

stats is a metric aggregator

stats is a metric aggregator

The other kind of aggregator is a bucketing aggregator. These aggregators produce buckets, where a bucket has an associated value and a set of documents. For example, the terms-aggregator produces a bucket per term for the field it’s aggregated on. A document can end up in multiple buckets if the document has multiple values for the field being aggregated on. If a bucket aggregator has one or more “downstream” (i.e. child) aggregators, these are run on every bucket produced.

terms is an aggregator that produces buckets

terms is an aggregator that produces buckets

Bucket Ordering

By default, the buckets for a terms aggregator are ordered by the number of documents within them, high to low. You can make the order depend on the value of a metrics sub-aggregator as well. For example, you may want to produce a list of products and order them by the revenue generated, and not just the sales count. Similarily, the histogram-aggregation orders buckets by the implied key by default, but this can be changed as well.

Ordering the buckets by computed metrics

Real World Example

Honza Král presents an excellent example of a more involved aggregation in his PyCon PL 2013-presentation “Explore Your Data”. Using Stackoverflow data as an example, he shows how to structure a query that can answer the question: “What’s the number of questions per tag? And per tag, what’s the average number of comments? Who’s commenting the most, and what’s the highest score of a comment per author?”

Summary

Aggregations bring a lot of new possibilities to the table. They will undoubtedly further strengthen Elasticsearch’s position as an analytics engine. In a future post we will look more into how they are implemented under the covers, in order to get a better understanding of the performance and memory complexity involved.

While the facet framework will continue to be supported for a while, it will be deprecated as of 1.0. Thus, you should plan on moving to the aggregations framework in the not too distant future.

컨텍스트 엔지니어링

벡터 데이터베이스

Search AI 기반 애플리케이션

로그

위협 보호

워크플로우

Elasticsearch

Kibana(Discover, 대시보드)

Elastic Agent Builder

AutoOps

파이프 쿼리 언어

Jina AI 검색 모델

Elastic Cloud Serverless

Elastic Cloud Hosted

자체 관리형 Elasticsearch

전자 상거래 검색

고객 지원 검색

검색 기반 앱

로그 분석

인프라 모니터링

디지털 경험 모니터링

앱 성능 모니터링

AIOps

LLM 통합 가시성

차세대 SIEM

보안 워크플로우

XDR 및 엔드포인트 보안

보안을 위한 AI

데이터 가치 10배 향상

클라우드 서비스 제공자

Elastic AI 에코시스템

Search AI 파트너 프로그램

AV-Comparatives

Forrester Wave™ 리더

Gartner Magic Quadrant 리더

IDC MarketScape 리더

검색

보안

통합 가시성

시작하기

데모 갤러리

다운로드

통합

설명서

Elastic Search Labs

Elastic Security Labs

Elastic Observability Labs

블로그

커뮤니티

이벤트

웨비나

토론

교육

지원

컨설팅