Tech Topics

1.0.0.Beta2 Released

Today we are delighted to announce the release of elasticsearch 1.0.0.Beta2, the second beta release on the road to 1.0.0 GA. The new features we have planned for 1.0.0 have come together more quickly than we expected, and this beta release is chock full of shiny new toys. Christmas has come early!

We have added:

Please download elasticsearch 1.0.0.Beta2, try it out, break it, figure out what is missing and tell us about it. Our next release will focus on cleaning up inconsistent APIs and usability, plus fixing any bugs that are reported in the new functionality, so your early bug reports are an important part of ensuring that 1.0.0 GA is solid.

WARNING: This is a beta release – it is not production ready, features are not set in stone and may well change in the next version, and once you have made any changes to your data with this release, it will no longer be readable by older versions!

Snapshot / Restore

While it has always been possible to backup a live index on Elasticsearch, the process was a bit finicky. The long awaited snapshot/restore API makes it easy.

First, define a repository — the place where your backups will live:

curl -XPUT 'http://localhost:9200/_snapshot/my_backup' -d '{
    "type": "fs",
    "settings": {
        "location": "/mount/backups/my_backup"
    }
}'

Currently we support the shared filesystem (fs) repository, which needs to be writable by all nodes. In the future we will add support for S3, HDFS, GlusterFS, Google Compute Engine and Microsoft Azure.

With the repository in place, you can tell Elasticsearch to create a snapshot named snapshot_1 with:

curl -XPUT localhost:9200/_snapshot/my_backup/snapshot_1

You can snapshot the whole cluster or just specific indices. The best part is that snapshots are incremental — it only copies the segments that have changed since the last snapshot was made. This makes the snapshotting process faster and lighter (you can snapshot every 5 minutes if you want to) and uses up less storage in the repository.

You can restore the whole cluster, with or without persistent cluster settings, or just individual indices:

curl -XPOST localhost:9200/_snapshot/my_backup/snapshot_1/_restore

Later on we plan on making cross data-center replication possible by adding the ability to do incremental restores into a read-only index.

You can read more about snapshot/restore in the reference docs.

cat API

JSON is a great… for computers. But at 3 AM when you’re trying to figure out what is happening in your cluster, we humans prefer a simpler text format, easier to read and use with command line tools like sort and awk. The new cat API is the system administrator’s friend:  — it formats the results in a columnar fashion for easy reading and parsing.

For instance, to find out which indices are status yellow:

curl localhost:9200/_cat/indices | grep ^yell
yellow foo          5 1   4 0    17kb    17kb

Column headers are not returned by default, but you can request them by adding the v parameter to the query string:

curl localhost:9200/_cat/recovery?v
index shard   target recovered     % ip            node
wiki1 2     68083830   7865837 11.6% 192.168.56.20 Adam II
wiki2 1      2542400    444175 17.5% 192.168.56.20 Adam II

Above we can see that two shards are recovering after a node failure, and that they are 12% and 18% complete, respectively.

There are cat endpoints for many admin APIs, including allocation, indices, nodes, and shards. See the cat reference docs for details.

Aggregations

Aggregations are “facets reborn”. Facets are amazingly powerful — they allow you to summarise vast amounts of data in the context of a user’s query on the fly, without the need for slow batch precalculations. However, the way they are implemented limits them in two important ways:

  • You can’t combine two facet types without writing new code. For instance, to generate statistics on popular terms, we had to write a new terms_stats facet instead of just being able to plug the statistical facet into the terms facet
  • They are one level deep only. You can ask for “counts of popular hashtags”, or “counts of tweets per day”, but you can’t ask for “counts of popular hashtags per day”.

Aggregations change all this.

There are two types of aggregation (or aggs): bucket aggregators which allow you to divide documents up into separate buckets eg terms, range, histogram, date_histogram, filter, etc and metric aggregators which perform some calculation on the documents in each bucket, eg count, avg, stats etc.

Buckets can be sub-divided into smaller buckets, and metrics can be calculated for any bucket at any level. In the following example, we run a query looking for all tweets which mention elasticsearch, count the most popular hashtags overall, count the number of tweets per day and calculate the most popular hashtags per day:

GET /tweets/_search
{
    "query": {
        "match": { "tweet": "elasticsearch" }
    },
    "aggs": {
        "popular_hashtags_overall": {
            "terms": { "field": "hashtags" }
        },
        "per_day": {
            "date_histogram": {
                "field":    "date",
                "interval": "day"
            },
            "aggs": {
                "popular_hashtags": {
                    "terms": { "field": "hashtags" }
                }
            }
        }
    }
}

Doing this with facets would have been unthinkable — it would have required a separate query for every day!

Now that this framework is in place, it makes it much easier to add new aggregations. And as soon as a new aggregation has been implemented, it can used immediately in combination with any of the other aggregations.

Facets are not going away just yet. We want to give you time to migrate from facets to aggregations so both implementations will be available for the foreseeable future. Also facets have had many micro-optimizations added over the years to make them perform as well as they do. Aggregations are not yet as fast for the simple case, but we are working hard to squeeze more performance out of them.

You can read more about aggregations in the reference docs.

Still to come

Before the next release, we are going to be looking at tidying up inconsistencies and warts that have crept into our API. We want to make it clean, consistent, and as Do-What-I-Mean as possible. You can take a look at the issues we are considering here. We welcome further discussion.

Please download Elasticsearch 1.0.0.Beta2, try it out, and let us know what you think.