Store Embeddings in Elasticsearch

Elasticsearch provides full support for storing and retrieving vectors, which makes it an ideal database for working with embeddings.

Field Types

In the Full-Text Search chapter of this tutorial you have learned how to create an index with several fields. At that time it was mentioned that Elasticsearch can, for the most part, automatically determine the best type to use for each field based on the data itself. Even though Elasticsearch 8.11 is able to automatically map some vector types, in this chapter you will define this type explicitly as an opportunity to learn more about type mappings in Elasticsearch.

Retrieving Type Mappings

The types associated with each field in an index are determined in a process called mapping, which can be dynamic or explicit. The mappings that were created in the Full-Text Search portion of this tutorial were all dynamically generated by Elasticsearch.

The Elasticsearch client offers a get_mapping method, which returns the type mappings that are in effect for a given index. If you want to explore these mappings on your own, start a Python shell and enter the following code:

from app import es
es.es.indices.get_mapping(index='my_documents')

The response from the get_mapping() method is a dictionary with information about every field in the index. For your convenience, below is a nicely formatted structure of this information for the my_documents index created in the Full-Text Search section of the tutorial:

{
  "my_documents": {
    "mappings": {
      "properties": {
        "category": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "content": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "created_on": {
          "type": "date"
        },
        "name": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "rolePermissions": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "summary": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "updated_at": {
          "type": "date"
        },
        "url": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        }
      }
    }
  }
}

From this you can see that the created_on and updated_at fields were automatically typed with date, while every other field received text as type. When trying to decide on a type, Elasticsearch first checks the type of the data, which helps it assign numeric, boolean and object types to fields. When the field data is a string, it also tries to see if the data matches a date pattern. Detection on strings based on patterns can also be enabled for numbers if desired.

The text fields have a fields definition with a keyword entry. This is called a sub-field, an alternative or secondary type that is available to use when appropriate. In Elasticsearch, dynamically typed text fields are given a keyword sub-field. You have already used the category.keyword sub-field to perform an exact search for a given category. To avoid having the sub-field added, an explicit mapping of text or keyword can be given, and then this will be the main and only type.

Adding a Vector Field to the Index

Let's add a new field to the index where an embedding for each document will be stored.

The structure of an explicit mapping matches the mappings key of the response returned by the get_mapping() method of the Elasticsearch client. Only the fields that need to be typed explicitly need to be given, as any fields that are not included in the mapping will continue to be typed dynamically as before.

Below you can see a new version of the create_index() method of the Search class, adding an explicitly typed field named embedding. Replace this method in search.py:

class Search:
    # ...

    def create_index(self):
        self.es.indices.delete(index='my_documents', ignore_unavailable=True)
        self.es.indices.create(index='my_documents', mappings={
            'properties': {
                'embedding': {
                    'type': 'dense_vector',
                }
            }
        })

As you can see, the embedding field is given a type of dense_vector, which is the appropriate type when storing embeddings. You will later learn about another type of vector, the sparse_vector, which is useful in other types of semantic search applications.

The dense_vector type accepts a few parameters, all of which are optional.

dims: the size of the vectors that will be stored. Since version 8.11 the dimensions are automatically assigned when the first document is inserted.
index: must be set to True to indicate that the vectors should be indexed for searching. This is the default.
similarity: the distance function to use when comparing vectors. The two most common ones are dot_product and cosine. The dot product is more efficient, but it requires vectors to be normalized. The default is cosine.

Adding Embeddings to Documents

In the previous section you have learned how to generate embeddings using the SentenceTransformers framework and the all-MiniLM-L6-v2 model. Now it is time to integrate the model into the application.

First of all, the model can be instantiated in the Search class constructor:

# ...
from sentence_transformers import SentenceTransformer

# ...

class Search:
    def __init__(self):
        self.model = SentenceTransformer('all-MiniLM-L6-v2')
        self.es = Elasticsearch(cloud_id=os.environ['ELASTIC_CLOUD_ID'],
                                api_key=os.environ['ELASTIC_API_KEY'])
        client_info = self.es.info()
        print('Connected to Elasticsearch!')
        pprint(client_info.body)
      
    # ...

As you recall from the full-text search portion of this tutorial, the Search class has insert_document() and insert_documents() methods, to insert single and multiple documents into the index respectively. These two methods now need to generate the corresponding embeddings that go with each document.

The next code block shows new versions of these two methods, along with a new get_embedding() helper method that returns an embedding.

class Search:
    # ...

    def get_embedding(self, text):
        return self.model.encode(text)

    def insert_document(self, document):
        return self.es.index(index='my_documents', document={
            **document,
            'embedding': self.get_embedding(document['summary']),
        })

    def insert_documents(self, documents):
        operations = []
        for document in documents:
            operations.append({'index': {'_index': 'my_documents'}})
            operations.append({
                **document,
                'embedding': self.get_embedding(document['summary']),
            })
        return self.es.bulk(operations=operations)

The modified methods add the new embedding field to the document to be inserted. The embedding is generated from the summary field of each document. In general, embeddings are generated from sentences or short paragraphs, so in this case the summary is an ideal field to use. Other options would have been the name field, which contains the title of the document, or maybe the first few sentences from the document's body.

With these changes in place the index can be rebuilt, so that it stores an embedding for each document. To rebuilt the index, use this command:

flask reindex

In case you need a reminder, the flask reindex command is implemented in the reindex() function in app.py. It calls the reindex() method of the Search class, which in turn invokes create_index() and then passes all the data from the data.json file to insert_documents().