July 12, 2017

A Practical Introduction to Elasticsearch

Editor's Note (August 3, 2021): This post uses deprecated features. Please reference the map custom regions with reverse geocoding documentation for current instructions.

Why this post?

I recently had the pleasure of teaching a Master's class at the University of A Coruña, in the course Information Retrieval and Semantic Web. The focus of this lesson was to provide a general vision of Elasticsearch to the students so they would be enabled to start using Elasticsearch in the course assignments; the attendees ranged from people already familiar with Lucene to people facing Information Retrieval concepts for the first time. Being a very late class (it started at 7:30PM) one of the challenges was to keep the attention of the students (or, in other words, to keep them from falling asleep!). There are two basic approaches to keep attention when teaching: bringing chocolate - which I forgot to do - and making the lesson as practical as possible.

And this is what today’s post provides: we will go through the practical part of that same lesson. The goal is not to learn every single command or request in Elasticsearch (that is why we have documentation); instead, the goal is that you experiment with the joy of using Elasticsearch without prior knowledge in a 30-60 minute guided tutorial. Just copy-paste every single request to see the results, and try to figure out the solution to the proposed questions.

Who will benefit from this post?

I will show basic features in Elastic to introduce some of the main concepts, sometimes introducing more technical or complex concepts, and will link the documentation for further reference (but keep this in mind: for further reference. You can just continue with the examples and leave the documentation for later). If you have not used Elasticsearch before and you want to see it in action - and also to be the director of the action - this is for you. If you are already experienced with Elasticsearch, take a look at the dataset we will be using: when a friend asks you what can you do with Elasticsearch it is easier to explain it with searches in Shakespeare’s plays!

What we will and will not cover?

We will start adding some documents, searching and removing them. After that, we will use a Shakespeare dataset to provide more insight on searches and aggregations. This is a hands-on “I want to start seeing it working right now” post.

Note that we will not cover anything related to configuration or to best-practices in production deployments: so, use this information to get a bite of what Elasticsearch offers, a starting point to envision how it can fit your needs.

Setup

First of all, you need Elasticsearch. Follow the documentation instructions to download the latest version, install it and start it. Basically, you need a recent version of Java, download and install Elasticsearch for your Operating System, and finally start it with the default values - bin/elasticsearch. In this lesson we will use the latest available version at the moment, 5.5.0.

Next, you need to communicate with Elasticsearch: this is done by issuing HTTP requests against the REST API. Elastic is started by default in port 9200. To access, you can use the tool that best fits your expertise: there are command-line tools (like curl for Linux), web-browser REST plugins for Chrome or Firefox, or you can just install Kibana and use the console plugin. Each request consists of a HTTP verb (GET, POST, PUT…), an URL endpoint and an optional body - in most of the cases, the body is a JSON object.

As an example, and to confirm that Elasticsearch is started, let’s do a GET against the base URL to access to the basic endpoint (no body is needed):

GET localhost:9200

The response should look similar to the following. Since we did not configure anything, the name of our instance will be a random 7 letters string:

{
    "name": "t9mGYu5",
    "cluster_name": "elasticsearch",
    "cluster_uuid": "xq-6d4QpSDa-kiNE4Ph-Cg",
    "version": {
        "number": "5.5.0",
        "build_hash": "260387d",
        "build_date": "2017-06-30T23:16:05.735Z",
        "build_snapshot": false,
        "lucene_version": "6.6.0"
    },
    "tagline": "You Know, for Search"
}

Some basic examples

We already have a clean Elasticsearch instance initialized and running. The first thing we are going to do is to add documents and to retrieve them. Documents in Elasticsearch are represented in JSON format. Also, documents are added to indices, and documents have a type. We are adding now to the index named accounts a document of type person having the id 1; since the index does not exist yet, Elasticsearch will automatically create it.

POST localhost:9200/accounts/person/1 
{
    "name" : "John",
    "lastname" : "Doe",
    "job_description" : "Systems administrator and Linux specialit"
}

The response will return information about the document creation:

{
    "_index": "accounts",
    "_type": "person",
    "_id": "1",
    "_version": 1,
    "result": "created",
    "_shards": {
        "total": 2,
        "successful": 1,
        "failed": 0
    },
    "created": true
}

Now that the document exists, we can retrieve it:

GET localhost:9200/accounts/person/1

The result will contain metadata and also the full document (shown in the _source field) :

{
    "_index": "accounts",
    "_type": "person",
    "_id": "1",
    "_version": 1,
    "found": true,
    "_source": {
        "name": "John",
        "lastname": "Doe",
        "job_description": "Systems administrator and Linux specialit"
    }
}

The keen reader already realized that we made a typo in the job description (specialit); let’s correct it by updating the document (_update):

POST localhost:9200/accounts/person/1/_update
{
      "doc":{
          "job_description" : "Systems administrator and Linux specialist"
       }
}

After the operation succeeds, the document will be changed. Let´s retrieve it again and check the response:

{
    "_index": "accounts",
    "_type": "person",
    "_id": "1",
    "_version": 2,
    "found": true,
    "_source": {
        "name": "John",
        "lastname": "Doe",
        "job_description": "Systems administrator and Linux specialist"
    }
}

To prepare for the next operations, let’s add an additional document with id 2:

POST localhost:9200/accounts/person/2
{
    "name" : "John",
    "lastname" : "Smith",
    "job_description" : "Systems administrator"
}

So far, we did retrieve documents by id, but we did not do searches. When querying using the Rest API we can pass the query in the body of the request or directly in the URL with a specific syntax. In this section we will do searches directly in the URL in the format /_search?q=something:

GET localhost:9200/_search?q=john

This search will return both documents, since both of them include john:

{
    "took": 58,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
    },
    "hits": {
        "total": 2,
        "max_score": 0.2876821,
        "hits": [
            {
                "_index": "accounts",
                "_type": "person",
                "_id": "2",
                "_score": 0.2876821,
                "_source": {
                    "name": "John",
                    "lastname": "Smith",
                    "job_description": "Systems administrator"
                }
            },
            {
                "_index": "accounts",
                "_type": "person",
                "_id": "1",
                "_score": 0.28582606,
                "_source": {
                    "name": "John",
                    "lastname": "Doe",
                    "job_description": "Systems administrator and Linux specialist"
                }
            }
        ]
    }
}

In this result we can see the matching documents and also some metadata like the total number of results for the query. Let´s keep doing more searches. Before running the searches, try to figure out by yourself what documents will be retrieved (the response comes after the command):

GET localhost:9200/_search?q=smith

This search will return only the last document we added, the only one containing smith.

GET localhost:9200/_search?q=job_description:john

This search will not return any document. In this case, we are restricting the search only to the field job_description, which does not contain the term. As an exercise for the reader, try to do:

a search in that field that will return only the document with id 1
a search in that field returning both documents
a search in that field returning only the document with id 2 - you will need a hint: the “q” parameter uses the same syntax as the Query String.

This last example brings a related question: we can do searches in specific fields; is it possible to search only within a specific index? The answer is yes: we can specify the index and type in the URL. Try this:

GET localhost:9200/accounts/person/_search?q=job_description:linux

Additionally to searching in one index, we can search in multiple indices at the same time by providing a comma-separated list of indices names, and the same can be done for types. There are more options: information about them can be found in Multi-Index, Multi-type. As an exercise for the reader, add documents to a second (different) index and do searches in both indices simultaneously.

To close this section, we will delete a document, and then the entire index. After deleting the document, try to retrieve or find it in searches.

DELETE localhost:9200/accounts/person/1

The response will be confirmation:

{
    "found": true,
    "_index": "accounts",
    "_type": "person",
    "_id": "1",
    "_version": 3,
    "result": "deleted",
    "_shards": {
        "total": 2,
        "successful": 1,
        "failed": 0
    }
}

Finally, we can delete the full index.

DELETE localhost:9200/accounts

And this is the end of the first section. Let’s summarize what we did:

Added a document. Implicitly, an index was created (the index did not previously exist).
Retrieved the document.
Updated the document, to correct a typo, and check it was corrected.
Added a second document.
Searched, including searches using implicitly all of the fields, and a search focused only in one field.
Proposed several search exercises.
Explained the basics on searching in several indices and types simultaneously.
Proposed searching in several indices simultaneously.
Deleted one document.
Deleted an entire index.

For more information about the topics in this section see the following links:

Playing with more interesting data.

So far, we played with some fictional data. In this section we will be exploring Shakespeare plays. The first step is to download the file shakespeare.json - available from Kibana: loading sample data. Elasticsearch offers a Bulk API that allows you to perform add, delete, update and create operations in bulk, i.e, a lot at a time. This file contains data ready to be ingested using this API, prepared to be indexed in an index called Shakespeare containing documents of type act, scene and line. The body for requests to the Bulk API consists of 1 JSON object per line; for addition operations, like the ones in the file, there is 1 JSON object indicating metadata about the add operation and a second JSON object in the next line containing the document to add:

{"index":{"_index":"shakespeare","_type":"act","_id":0}}
{"line_id":1,"play_name":"Henry IV","speech_number":"","line_number":"","speaker":"","text_entry":"ACT I"}

We will not dig deeper into the Bulk API: if the reader is interested, please refer to the Bulk documentation.

Let’s get all this data into Elasticsearch. Since the body of this request is fairly big (more than 200,000 lines), it is recommended to do this via a tool that allows to load the body of a request from a file - for instance, using curl:

curl -XPOST "localhost:9200/shakespeare/_bulk?pretty" --data-binary @shakespeare.json

Once the data is loaded, we can start doing some searches. In the previous section we did the searches passing the query in the URL. In this section, we will introduce the Query DSL which specifies a JSON format to be used in the body of search requests to define the queries. Depending on the type of operation, queries can be issued using both GET and POST verbs. Let’s start with the simplest one: getting all of the documents. To do this, we specify in the body a query key, and for the value the match_all query.

GET localhost:9200/shakespeare/_search
{
    "query": {
            "match_all": {}
    }
}

The result will show 10 documents; a partial output follows:

{
    "took": 7,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
    },
    "hits": {
        "total": 111393,
        "max_score": 1,
        "hits": [
              ...          
            {
                "_index": "shakespeare",
                "_type": "line",
                "_id": "19",
                "_score": 1,
                "_source": {
                    "line_id": 20,
                    "play_name": "Henry IV",
                    "speech_number": 1,
                    "line_number": "1.1.17",
                    "speaker": "KING HENRY IV",
                    "text_entry": "The edge of war, like an ill-sheathed knife,"
                }
            },
            ...

The format for searches is pretty straight forward. A lot of different types of searches are available: Elastic offers direct searches (“Search for this term”, “Search elements in this range”, etc) and Compound queries (“a AND b”, “a OR b”, etc). The full reference can be found in the Query DSL documentation; we will do just some examples here to get familiar with how we can use them.

POST localhost:9200/shakespeare/scene/_search/
{
    "query":{
        "match" : {
            "play_name" : "Antony"
        }
    }
}

In the previous query we are searching for all of the scenes (see the URL) in which the play name contains Antony. We can refine this search, and select also the scenes in which Demetrius is the speaker:

POST localhost:9200/shakespeare/scene/_search/
{
    "query":{
        "bool": {
            "must" : [
                {
                    "match" : {
                        "play_name" : "Antony"
                    }
                },
                {
                    "match" : {
                        "speaker" : "Demetrius"
                    }
                }
            ]
        }
    }
}

As a first exercise for the reader, modify the previous query so the search returns not only scenes in which the speaker is Demetrius, but also scenes in which the speaker is Antony - as a hint, check the boolean should clause. As a second exercise for the reader, it is left to explore the different options that can be used in the Request body when searching - for instance, selecting from what position in the results we want to start and how many results we want to retrieve to do pagination.

So far, we did some queries using the Query DSL. What if, apart from retrieving the contents we are looking for, we can also do some analytics? This is where aggregations come into play. Aggregations allow us to get a deeper insight into the data: for instance, how many different plays exist in our current dataset? How many scenes are there on average per work? What are the works with more scenes?

Before jumping into practical examples, let’s take a step back to when we created the Shakespeare index, since continuing without a bit of theory would be a waste. In Elastic, we can create indices defining what the datatypes are for the different fields they can have: numeric fields, keyword fields, text fields… there are a lot of datatypes. The datatypes that an index can have are defined via the mappings. In this case, we did not create any index prior to indexing documents, so Elastic decided what was the type of each field (it created the mapping of the index). The type text was selected for the textual fields: this type is analyzed, that is what allowed us to find the play_name Antony and Cleopatra by simply searching Antony. By default we cannot do aggregations in analyzed fields. How are we going to show aggregations, if the fields are not valid to do them? Elastic, when it decided the type of each field, also added a non analyzed version of the text fields (called keyword) just in case we wanted to do aggregations/sortings/scripts: we can just use play_name.keyword in the aggregations. As an exercise for the reader, it is left how to inspect the current mappings.

And after this relatively small and relatively theoretic lesson, let’s jump back to the keyboard and to aggregations! We can start inspecting our data by checking how many different plays we have:

GET localhost:9200/shakespeare/_search
{
    "size":0,
    "aggs" : {
        "Total plays" : {
            "cardinality" : {
                "field" : "play_name.keyword"
            }
        }
    }
}

Note since we are not interested in the documents, we just decided to show 0 results. Also since we want to explore the entire index, we do not have a query section: the aggregations will be computed using all of the documents that are fulfilling the query, which defaults to a match_all in this case. Finally, we decide to use a cardinality aggregation which will let us know how many unique values we have for the play_name field.

{
    "took": 67,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
    },
    "hits": {
        "total": 111393,
        "max_score": 0,
        "hits": []
    },
    "aggregations": {
        "Total plays": {
            "value": 36
        }
    }
}

Now, let’s list the plays that appear more often in our dataset:

GET localhost:9200/shakespeare/_search
{
    "size":0,
    "aggs" : {
        "Popular plays" : {
            "terms" : {
                "field" : "play_name.keyword"
            }
        }
    }
}

Being the result:

{
    "took": 35,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
    },
    "hits": {
        "total": 111393,
        "max_score": 0,
        "hits": []
    },
    "aggregations": {
        "Popular plays": {
            "doc_count_error_upper_bound": 2763,
            "sum_other_doc_count": 73249,
            "buckets": [
                {
                    "key": "Hamlet",
                    "doc_count": 4244
                },
                {
                    "key": "Coriolanus",
                    "doc_count": 3992
                },
                {
                    "key": "Cymbeline",
                    "doc_count": 3958
                },
                {
                    "key": "Richard III",
                    "doc_count": 3941
                },
                {
                    "key": "Antony and Cleopatra",
                    "doc_count": 3862
                },
                {
                    "key": "King Lear",
                    "doc_count": 3766
                },
                {
                    "key": "Othello",
                    "doc_count": 3762
                },
                {
                    "key": "Troilus and Cressida",
                    "doc_count": 3711
                },
                {
                    "key": "A Winters Tale",
                    "doc_count": 3489
                },
                {
                    "key": "Henry VIII",
                    "doc_count": 3419
                }
            ]
        }
    }
}

We can see the 10 most popular values of play_name. It is up to the reader to dig in the documentation to figure out how to show more or less values in the aggregation.

If you made it this far, you can surely figure out the next step: combining aggregations. We could be interested in knowing how many scenes, acts and lines we have in the index; but also, we could be interested in the same value per play. We can do this by nesting aggregations inside of aggregations:

GET localhost:9200/shakespeare/_search
{
    "size":0,
    "aggs" : {
        "Total plays" : {
            "terms" : {
                "field" : "play_name.keyword"
            },
            "aggs" : {
                "Per type" : {
                    "terms" : {
                        "field" : "_type"
                     }
                }
            }
        }
    }
}

And a part of the response:

    "aggregations": {
        "Total plays": {
            "doc_count_error_upper_bound": 2763,
            "sum_other_doc_count": 73249,
            "buckets": [
                {
                    "key": "Hamlet",
                    "doc_count": 4244,
                    "Per type": {
                        "doc_count_error_upper_bound": 0,
                        "sum_other_doc_count": 0,
                        "buckets": [
                            {
                                "key": "line",
                                "doc_count": 4219
                            },
                            {
                                "key": "scene",
                                "doc_count": 20
                            },
                            {
                                "key": "act",
                                "doc_count": 5
                            }
                        ]
                    }
                },
                ...

There are plenty of different aggregations in Elasticsearch: aggregations using the result of aggregations, metrics aggregations like cardinality, buckets aggregations like terms… It is up to the reader to take a look at the list and decide which aggregation will fit a specific use case that you may already have in mind! Maybe a Significant terms aggregation to find the uncommonly common?

And this is the end of the second section. Let’s summarize what we did:

Used the Bulk API to add Shakespeare plays.
Simple searches, inspecting the generic format to do queries via the Query DSL.
A leaf search, searching for text in a field.
A compound search, combining 2 text searches.
Proposed adding a second compound search.
Proposed testing the different options in the Request body.
Introduced the concept of aggregations, along with a brief review of Mappings and Field types.
Calculated how many plays we have in our dataset.
Retrieved which ones were the plays appearing more often in the dataset.
Combined several aggregations to see how many acts, scenes and lines each of the 10 most frequent plays had.
Proposed exploring some more of the aggregations in Elastic.

Some extra advice

During these exercises we played a bit with the concept of types in Elasticsearch. In the end, a type is just an internal extra field: It is worth to remark that from version 6 we will only allow to create indices with a single type, and from version 7 it is expected that types will be removed. More information can be found in this blog post.

Conclusion

In this post, we used some examples to give a bit of a taste of Elasticsearch.

There is much (much!) more in Elasticsearch and in the Elastic stack to explore than what was shown in this brief article. An item worth mentioning before we finish is Relevancy. Elasticsearch is not only about Does this document satisfy my search requirement?, but also covers How well does this document satisfy my search requirement?, offering up first the search results which are most relevant to the query. The documentation is extensive and full of examples.

It’s advised before implementing any new custom feature to check the documentation to see if we have already implemented it making it easily leveraged for your project. It’s quite possible that a feature or idea that you think would be useful might be already available, since our development roadmap is heavily influenced by what our users and developers let us know they want!

If you need authentication/access_control/encryption/audit, these features are already available in Security. If you need to monitor the cluster, this is available in Monitoring. If you need to create fields on demand in your results, we already allow it via Script fields. If you need to create alerts by e-mail/Slack/Hipchat/etc, this is available via Alerting. If you need to visualize the data and aggregations in graphs, we already offer a rich environment to do this - Kibana. If you need to index data from databases, log files, management queues, or almost any imaginable source, this is offered via Logstash and Beats. If you need, if you need, if you need… Check if you can already have it!