Create an Elasticsearch Index

Two very important concepts in Elasticsearch are documents and indexes.

A document is collection of fields with their associated values. To work with Elasticsearch you have to organize your data into documents, and then add all your documents to an index. You can think of an index as a collection of documents that is stored in a highly optimized format designed to perform efficient searches.

If you have worked with other databases, you may know that many databases require a schema definition, which is essentially a description of all the fields that you want to store and their types. An Elasticsearch index can be configured with a schema if desired, but it can also automatically derive the schema from the data itself. In this section you are going to let Elasticsearch figure out the schema on its own, which works quite well for simple data types such as text, numbers and dates. Later, after you are introduced to more complex data types, you will learn how to provide explicit schema definitions.

Create the Index

This is how you create an Elasticsearch index using the Python client library:

self.es.indices.create(index='my_documents')

In this example, self.es is an instance of the Elasticsearch class, which in this tutorial is stored in the Search class in search.py. An Elasticsearch deployment can be used to store multiple indexes, each identified by a name such as my_documents in the example above.

Indexes can also be deleted:

self.es.indices.delete(index='my_documents')

If you attempt to create an index with a name that is already assigned to an existing index, you will get an error. Sometimes it is useful to create an index automatically deleting a previous instance of the index if it exists. This is especially useful while developing an application, because you will likely need to regenerate an index several times.

Let's add a create_index() helper method in search.py. Open this file in your code editor, and add the following code at the bottom, leaving the existing contents as they are:

class Search:
    # ...

    def create_index(self):
        self.es.indices.delete(index='my_documents', ignore_unavailable=True)
        self.es.indices.create(index='my_documents')

The create_index() method first deletes an index with the name my_documents. The ignore_unavailable=True option prevents this call from failing when the index name isn't found. The following line in the method creates a brand new index with that same name.

The example application featured in this tutorial needs a single Elasticsearch index, and for that reason it hardcodes the index name as my_documents. For more complex applications that use multiple indexes, you may consider accepting the index name as an argument.

Add Documents to the Index

In the Elasticsearch client library for Python, a document is represented as a dictionary of key/value fields. Fields that have a string value are automatically indexed for full-text and keyword search, but in addition to strings you can use other field types such as numbers, dates and booleans, which are also indexed for efficient operations such as filtering. You can also build complex data structures in which a field is set to a list or a dictionary with sub-items.

To insert a document into an index, the index() method of the Elasticsearch client is used. For example:

document = {
    'title': 'Work From Home Policy',
    'contents': 'The purpose of this full-time work-from-home policy is...',
    'created_on': '2023-11-02',
}
response = es.index(index='my_documents', body=document)
print(response['_id'])

The return value from the index() method is the response returned by the Elasticsearch service. The most important piece of information returned in this response is an item with a key name of _id, representing the unique identifier that was assigned to the document when it was inserted into the index. The identifier can be used to retrieve, delete or update the document.

Now that you know how to insert a document, let's continue building a library of useful helpers in search.py, with a new insert_document() method for the Search class. Add this method at the bottom of search.py:

class Search:
    # ...

    def insert_document(self, document):
        return self.es.index(index='my_documents', body=document)

The method accepts the Elasticsearch client and a document from the caller, and inserts the document into the my_documents index, returning the response from the service.

NOTE: These operations are not covered in this tutorial, but the Elasticsearch client can also modify and delete documents. See the Elasticsearch class reference in the Python library documentation to learn about all the operations that are available.

Ingesting Documents from a JSON File

When setting up a new Elasticsearch index, you are likely going to need to import a large number of documents. For this tutorial, the starter project includes a data.json file with some data in JSON format. In this section you will learn how to import all the documents contained in this file into the index.

The structure of the documents that are included in the data.json is as follows:

  • name: the document title
  • url: a URL to the document hosted on an external site
  • summary: a short summary of the contents of the document
  • content: the body of the document
  • created_on: creation date
  • updated_at: update date (could be missing if the document was never updated)
  • category: the document's category, which can be github, sharepoint or teams
  • rolePermissions: a list of role permissions

At this point you are encouraged to open data.json in your editor to familiarize yourself with the data that you are going to work with.

In essence, importing a large number of documents is no different than importing one document inside a for-loop. To import the entire contents of the data.json file, you could do something like this:

import json
from search import Search
es = Search()
with open('data.json', 'rt') as f:
    documents = json.loads(f.read())
for document in documents:
    es.insert_document(document)

While this approach works, it does not scale well. If you had to insert a very large number of documents, you would need to make as many calls into the Elasticsearch service. Unfortunately there is a performance cost associated with each API call, and also the service has rate limits in place that prevent large number of calls to be made very quickly. For these reasons, it is best to use the bulk insertion feature of the Elasticsearch service, which allows several operations to be communicated to the service in a single API call.

The insert_documents() method shown below, which you should add at the bottom of search.py, uses the bulk() method to insert all the documents in a single call:

    def insert_documents(self, documents):
        operations = []
        for document in documents:
            operations.append({'index': {'_index': 'my_documents'}})
            operations.append(document)
        return self.es.bulk(operations=operations)

The method accepts a list of documents. Instead of adding each document separately, it assembles a single list called operations, and then passes the list to the bulk() method. For each document, two entries are added to the operations list:

  • A description of what operation to perform, set to index, with the name of the index given as an argument.
  • The actual data of the document

When processing a bulk request, the Elasticsearch service walks the operations list from the start and performs the operations that were requested.

Regenerating the Index

While you work on this tutorial you will need to regenerate the index a few times. To streamline this operation, add a reindex() method to search.py:

class Search:
    # ...

    def reindex(self):
        self.create_index()
        with open('data.json', 'rt') as f:
            documents = json.loads(f.read())
        return self.insert_documents(documents)

This method combines the create_index() and insert_documents() methods created earlier, so that with a single call the old index can be destroyed (if it exists) and a new index built and repopulated.

NOTE: When indexing a very large number of documents it would be best to divide the list of documents in smaller sets and import each set separately.

To make this method easier to invoke, let's expose it through the flask command. Open app.py in your code editor and add the following function at the bottom:

@app.cli.command()
def reindex():
    """Regenerate the Elasticsearch index."""
    response = es.reindex()
    print(f'Index with {len(response["items"])} documents created '
          f'in {response["took"]} milliseconds.')

The @app.cli.command() decorator tells the Flask framework to register this function as a custom command, which will be available as flask reindex. The name of the command is taken from the function's name, and the docstring is included here because Flask uses it in the --help documentation.

The response from the reindex() function, which in turn is the response from the bulk() method of the Elasticsearch client, contains some useful items that can be used to construct a nice status message. In particular, response['took'] is the duration of the call in milliseconds, and response['items'] is a list of the individual results of each operation, which is not actually useful directly, but the length of the list provides a count of documents inserted.

See how that looks by running flask --help from a terminal session, making sure the Python virtual environment is activated (if your terminal session is still running the Flask application, you can open a second terminal window). Towards the end of the help screen you should see the reindex option included as an available command along with other options provided by the Flask framework:

Commands:
  reindex  Regenerate the Elasticsearch index.
  routes   Show the routes for the app.
  run      Run a development server.
  shell    Run a shell in the app context.

Now when you want to generate a clean index, all you need to do is run flask reindex.

Share this article