Semantic Search as Service at a Search Center of Excellence

For many enterprises, a search Center of Excellence (COE) offers search as a service to their users to wrangle knowledge from disparate data sources and build search capabilities into their internal and external applications. Elasticsearch, the distributed search platform that “powers about 90% of all search bars on the internet”, is often the tool of choice for enterprise search COEs. With the wild popularity of ChatGPT, users are discovering the uncanny ability of LLMs to grasp meaning. This sparked an urgent demand for enterprise search COEs to offer an enhanced search experience: one that is intuitive, natural, context-driven, and recognizes user intent effortlessly. Elasticsearch, the most downloaded vector database, supports full vector search at scale and integrates natively with transformer-based LLM; Elasticsearch Relevance Engine (ESRE) enables search COEs to support enterprise-grade semantic search that is secure, scalable, and performant.

In this 2 part blog, we will explore how to implement and scale semantic search as a service for a search COE using Elastic Learned Sparse EncodeR (ELSER), a late interaction model that Elastic trains to deliver out-of-the-box semantic search capabilities without task-specific fine-tuning. We examine a use case of providing semantic search on internal wiki articles for developers, the process to implement it, and the learnings derived from it in part 1 of the blog, which will focus on the following areas:

Model Selection
Schema Design
Data Ingestion
Access Control
Search Techniques

Note: All code and content for this blog can be found here.

In the next part of this blog, we will survey scaling search services to developers with Elastic Search Applications.

Model Selection

The first task of implementing semantic search is to choose an embedding model that:

Converts text and its meanings into numerical vector representation
Works well for your use case

There are many options available, from commercial models OpenAI, Cohere, and Anthropic to open source models, like the ones hosted on HuggingFace, including Mistral 7B or Llama 2. Also, many enterprises have their own data science team that has fine-tuned their own LLM models using internal data.

The challenge of choosing an existing LLM for a search COE is the model may have been trained on domain-specific datasets. Typically, a search COE must cater to a wide variety of use cases and may not have the data science resources to domain adapt base models.

Elastic Learned Sparse EncodeR (ELSER) is a retrieval model that is designed for out-of-domain use cases, making it an easy choice for search COEs that prioritizes flexibility, speed, and streamlined implementation. For this blog, we will use ELSER to tackle our use case because we are working with English-only documents and can deploy the model with a single click within Elasticsearch. For multilingual vector search, Elasticsearch offers similar one-click support for E5 embedding model.

The image below depicts the basics of vector search. Semantic search is a subset of vector search as we are only concerned with text.

Schema Design

One "must have" for our use case is to return the most relevant passage of an article in response to a user query along with the metadata, the URL of the original document, the date it is published and updated, and the source of the wiki. The length of the documents varies, often exceeding the maximum input token limit (512) of the ELSER model, requiring the "chunking" of the text to avoid truncation. Also, we can store the article metadata along with the vector embeddings all within Elasticsearch.

Unlike a pure vector database, Elasticsearch can perform traditional BM25 searches on the text as well as filter and aggregate the metadata natively (temporal, numerical, spatial, boolean, etc.).

What this means is that Elasticsearch can perform all these techniques as well as perform vector searches within the same datastore and within a single search request. There is no need to pre/post process text or vectors, nor correlate vector search results in the application layer or with another data store. This greatly reduces the application complexity and number of tools required.

With the use case requirements defined, we are ready to create our mappings.

    PUT dev_wiki_elser
    {
      "mappings": {
        "properties": {
          "@timestamp": {
            "type": "date"
          },
          "access": {
            "type": "keyword"
          },
          "created_at": {
            "type": "date",
            "format": "iso8601"
          },
          "key_phrases": {
            "type": "text"
          },
          "links": {
            "type": "text"
          },
          "passages": {
            "type": "nested",
            "properties": {
              "is_truncated": {
                "type": "boolean"
              },
              "model_id": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              },
              "passage_tokens": {
                "type": "sparse_vector"
              },
              "text": {
                "type": "text"
              }
            }
          },
          "source": {
            "type": "keyword"
          },
          "summary": {
            "type": "text"
          },
          "updated_at": {
            "type": "date",
            "format": "iso8601"
          }
        }
      }
    }

We are using the nested field type for chunks of the original text and their corresponding sparse vector embeddings created with ELSER. The nested field type is a special type where each inner document is indexed as its own document, and its reference is stored with the containing or parent document. With the nested field type, we are not duplicating metadata with each text chunk. Also, we will not index the original document as sections of the documents are available in the passage field. One additional detail is that the summary field contains the title of the article.

Data Ingestion with Elastic Ingest Pipeline

The Elasticsearch inference processor allows us to create sparse vector embedding as we ingest the documents. The script processor partitions the article_content into segments of less than 512 terms, converts the text into vector embeddings, and indexes both the text and embeddings into the nested field passages. The ingest pipeline is adapted from the blog “Chunking Large Documents via Ingest pipelines plus nested vectors equals easy passage search”, which demonstrates using the ingest inference processor to chunk and convert text into dense vectors with an LLM model imported into Elasticsearch. Here, we are using the same technique of chunking by delimiters and calling ELSER to create sparse vector embeddings.

    PUT _ingest/pipeline/elser-v2-dev-wiki2
    {
      "processors": [
        {
          "script": {
            "description": "Chunk body into sentences by looking for . followed by a space",
            "lang": "painless",
            "source": """
              String[] envSplit = /((?<!M(r|s|rs)\.)(?<=\.) |(?<=\!) |(?<=\?) )/.split(ctx['article_content']);
              ctx['passages'] = new ArrayList();
              int i = 0;
              boolean remaining = true;
              if (envSplit.length == 0) {
                return
              } else if (envSplit.length == 1) {
                Map passage = ['text': envSplit[0]];ctx['passages'].add(passage)
              } else {
                while (remaining) {
                  Map passage = ['text': envSplit[i++]];
                  while (i < envSplit.length && passage.text.length() + envSplit[i].length() < params.model_limit) {passage.text = passage.text + ' ' + envSplit[i++]}
                  if (i == envSplit.length) {remaining = false}
                  ctx['passages'].add(passage)
                }
              }
              """,
            "params": {
              "model_limit": 400
            }
          }
        },
        {
          "foreach": {
            "field": "passages",
            "processor": {
              "inference": {
                "model_id": ".elser_model_2_linux-x86_64",
                "input_output": [
                  {
                    "input_field": "_ingest._value.text",
                    "output_field": "_ingest._value.passage_tokens"
                  }
                ],
                "on_failure": [
                  {
                    "append": {
                      "field": "_source._ingest.inference_errors",
                      "value": [
                        {
                          "message": "Processor 'inference' in pipeline 'elser-v2-dev-wiki' failed with message '{{ _ingest.on_failure_message }}'",
                          "pipeline": "elser-v2-dev-wiki",
                          "timestamp": "{{{ _ingest.timestamp }}}"
                        }
                      ]
                    }
                  }
                ]
              }
            }
          }
        },
        {
          "remove": {
            "field": [
              "article_content"
            ]
          }
        }
      ]
    }

As a part of the pipeline, we drop the article_content from the index as the content is available in chunks with the passages field.

The chunk technique used here, creating passages with sentence-ending punctuations, is a basic technique. For details and code on chunking with various methods, please see Calculating tokens for Semantic Search. Also, native chunking is coming soon in Elasticsearch.

For this blog, we will ingest 4 public sample documents using the pipeline. The documents are Elastic Cloud instructions on configuring traffic filtering. One of the documents has "access": "private"

    POST dev_wiki_elser/_doc?pipeline=elser-v2-dev-wiki
    {
      "summary": "IP traffic filters",
      "access": "private",
      "@timestamp": "2023-08-07T08:15:12.590363000Z",
      "key_phrases": """- Traffic filtering\\n- IP address \\n- CIDR block \\n- Egress IP filters \\n- Ingress IP filters\\n- Inbound IP filters\\n- Outbound IP filters\\n- IPv4 \\n- VPC Endpoint Gateway \\n- VPC Endpoint Interface \""",
      "updated_at": "2023-08-07T08:15:14.756374+00:00",
      "created_at": "2023-08-03T19:54:31.260012+00:00",
      "links": [
    "https://www.elastic.co/guide/en/cloud/current/ec-traffic-filtering-ip.html"
        ],
        "source": "CLOUD_INFRASTRACTURE_DOCUMENT",
        "article_content": """Traffic filtering, by IP address or CIDR block, is one of the security layers available in Elasticsearch Service. It allows you to limit how your deployments can be
    …
        Select the Delete icon. The icon is inactive if there are deployments assigned to the rule set.
        """
    }

The other three have "access": "public" attribute. Below is the article on configuring AWS Privatelink traffic filters.

    POST dev_wiki_elser/_doc?pipeline=elser-v2-dev-wiki
    {
      "summary": "AWS PrivateLink traffic filter",
      "access": "public",
      "@timestamp": "2023-08-07T08:15:12.590363000Z",
      "key_phrases": "- AWS\\\\n- PrivateLink \\\\n- VPC \\\\n- VPC Endpoint \\\\n- VPC Endpoint Service \\\\n- VPC Endpoint Policy \\\\n- VPC Endpoint Connection \\\\n- VPC Endpoint Interface \\\\n- VPC Endpoint Gateway \\\\n- VPC Endpoint Interface \\",
      "updated_at": "2023-08-07T08:15:14.756374+00:00",
      "created_at": "2023-08-03T19:54:31.260012+00:00",
      "links": [
        "https://www.elastic.co/guide/en/cloud/current/ec-traffic-filtering-vpc.html"
        ],
        "source": "CLOUD_INFRASTRACTURE_DOCUMENT",
        "article_content": """Traffic filtering, to only AWS PrivateLink
    …
        On the Security page, under Traffic filters select Remove."""
    }

The other two documents are very similar, with one for Azure Private Link traffic filters and one GCP Private Service Connect traffic filters.

Access Control

Elasticsearch supports role-based access control, including field and document level access control natively; it greatly reduces the operational and maintenance complexity required to secure our application. With our developer wiki use case, we can limit access to documents based on the access field value. For example, documents marked as private should not be accessible to users belonging to the public_documents user role, namely documents that have the attribute access: private doesn't exist for users who have the public_documents role. Besides limiting the document access for users with the public_documents user role, we can also remove the visibility to the field access.

Note: the code to PUT the role via the REST API is included in the gist.

The user sherry has the role of public_documents. For this user, a simple count of our index will only return 3 documents with the attribute "access": "public".

    curl -u sherry:my_password -XGET https://my-cluster.aws.elastic-cloud.com/dev_wiki_elser/_count/?pretty

The response:

    {
      "count" : 3,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      }
    }

Also, a call to see the index mappings will only return the fields the user is authorized.

    curl -u sherry:my_password -XGET https://my-cluster.aws.elastic-cloud.com/dev_wiki_elser/_mapping/?pretty
    {
      "dev_wiki_elser" : {
        "mappings" : {
          "properties" : {
            "@timestamp" : {
              "type" : "date"
            },
            "created_at" : {
              "type" : "date",
              "format" : "iso8601"
            },
            "key_phrases" : {
              "type" : "text"
            },
            "links" : {
              "type" : "text"
            },
            "passages" : {
              "type" : "nested",
              "properties" : {
                "passage_tokens" : {
                  "type" : "sparse_vector"
                },
                "text" : {
                  "type" : "text"
                }
              }
            },
            "source" : {
              "type" : "keyword"
            },
            "summary" : {
              "type" : "text"
            },
            "updated_at" : {
              "type" : "date",
              "format" : "iso8601"
            }
          }
        }
      }
    }

With this RBAC in place users with the role public_documents will only be able to search and access the documents and fields that they have privileges for. This is a simple yet powerful example of how easy it is to secure data and content natively within Elasticsearch. In addition, Elasticsearch fully supports other enterprise Auth/Auth frameworks such SAML, LDAP, Active Directory etc, so leveraging those frameworks to control access to the data is just a matter of simple configuration.

Search Techniques

Now, we are ready to move on to semantic search. Please note that search relevancy tuning is an iterative process that may require us to go back and use a different model than the one we have adopted and/or modify the index schema.

In addition, adding metadata filtering to our semantic search queries can improve relevance. Also, combining various search algorithms can be another option to optimize results, using BM25 with ELSER, hybrid search using reciprocal rank fusion(RRF), dense vectors along with ELSER, and some combinations of them. Elasticsearch supports all of them natively.

The corpus of our use case, developer wiki documents, tends to contain very similar topics and terms. The following query retrieves the most relevant passage on configuring the filter rule set to connect Google Private Service Connect to Elastic Cloud deployments using only the ELSER vector embeddings. We have set "_source": "false" as we only need to output the summary or title, URL, and the relevant section of the article. We skip the dates for clarity.

    GET dev_wiki_elser/_search
    {
      "_source": "false",
      "fields": [
        "summary",
        "links"
      ],
      "query": {
        "nested": {
          "path": "passages",
          "score_mode": "max",
          "query": {
            "text_expansion": {
              "passages.passage_tokens": {
                "model_id": ".elser_model_2_linux-x86_64",
                "model_text": "How to configure traffic filter rule set private link google"
              }
            }
          },
          "inner_hits": {
            "size": 1,
            "_source": false,
            "fields": [
              "passages.text"
            ]
          }
        }
      }
    }

Unfortunately, the first article returned is how to configure traffic filter sets for AWS instead of the article for Google Cloud. This is not surprising as “private link” is a common vocabulary describing a secure link between VPCs, which was first used by Amazon Web Services (AWS), like Kleenex to tissue; it is widely adopted by users though its equivalent in Google Cloud Platform is called private connect service.

    {
      "took": 19,
      "timed_out": false,
      "_shards": {
       …
      },
      "hits": {
        "total": {
          "value": 4,
          "relation": "eq"
        },
        "max_score": 20.017376,
        "hits": [
          {
            "_index": "dev_wiki_elser",
            "_id": "UGxiOY0Bds674Ci9z6yW",
            "_score": 20.017376,
            "_source": {},
            "fields": {
              "summary": [
                "AWS PrivateLink traffic filter"
              ],,
              "links": ["https://www.elastic.co/guide/en/cloud/current/ec-traffic-filtering-vpc.html"]
            },
            "inner_hits": {
              "passages": {
                "hits": {
                  "total": {
                    "value": 24,
                    "relation": "eq"
                  },
                  "max_score": 20.017376,
                  "hits": [
                    {
                      "_index": "dev_wiki_elser",
                      "_id": "UGxiOY0Bds674Ci9z6yW",
                      "_nested": {
                        "field": "passages",
                        "offset": 17
                      },
                      "_score": 20.017376,
                      "fields": {
                        "passages": [
                          {
                            "text": [
                              """Or, select Dedicated deployments to go to the deployments page to view all of your deployments.
        Under the Features tab, open the Traffic filters page.
        Select Create filter.
        Select Private link endpoint.
    ...

That is where the key_phrases field in our mappings comes in. In our use case, the articles have keywords and phrases attached, this provides an excellent way to augment the semantic search with a traditional BM25 search.

Often, content does not have keywords / phrases, however, that can easily be solved using an LLM to perform keyword extraction that distills the essence of the articles. This can be achieved using a public LLM like OpenAI or Gemini or a locally hosted LLM such as Mistral 7 or llama2. Keep an eye out for that in the next blog.

Wrapping our original text expansion search and a match query on the key_phrases field within a bool query ensures that the right documents are returned as expected.

    GET dev_wiki_elser/_search
    {
      "_source": "false",
      "fields": [
        "summary",
        "links"
      ],
      "query": {
        "bool": {
          "should": [
            {
              "nested": {
                "path": "passages",
                "score_mode": "max",
                "query": {
                  "text_expansion": {
                    "passages.passage_tokens": {
                      "model_id": ".elser_model_2_linux-x86_64",
                      "model_text": "How to configure traffic filter rule set private link google"
                    }
                  }
                },
                "inner_hits": {
                  "size": 1,
                  "_source": false,
                  "fields": [
                    "passages.text"
                  ]
                }
              }
            },
            {
              "match": {
                "key_phrases": "How to configure traffic filter rule set private link google"
              }
            }
          ]
        }
      }
    }

The response from the query correctly identifies the article in question.

    {
      "took": 22,
      "timed_out": false,
      "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
      },
      "hits": {
        "total": {
          "value": 4,
          "relation": "eq"
        },
        "max_score": 22.327188,
        "hits": [
          {
            "_index": "dev_wiki_elser",
            "_id": "UmxjOY0Bds674Ci9Dqzy",
            "_score": 22.327188,
            "_source": {},
            "fields": {
              "summary": [
                "Google Private Service Connect"
              ],
              "links": [
                "https://www.elastic.co/guide/en/cloud/current/ec-traffic-filtering-psc.html"
              ]
            },
            "inner_hits": {
              "passages": {
                "hits": {
                  "total": {
                    "value": 22,
                    "relation": "eq"
                  },
                  "max_score": 19.884687,
                  "hits": [
                    {
                      "_index": "dev_wiki_elser",
                      "_id": "UmxjOY0Bds674Ci9Dqzy",
                      "_nested": {
                        "field": "passages",
                        "offset": 16
                      },
                      "_score": 19.884687,
                      "fields": {
                        "passages": [
                          {
                            "text": [
                              """Add the Private Service Connect rules to your deployments
                              …

The other technique we experimented with for the use case is hybrid search with reciprocal rank fusion. For our use case, the approach is not as effective as the simple boolean query with text expansion and BM25. Additionally, the response passages often start in the middle of a paragraph due to the chunking technique that we are using, even though the relevance is good. We are able to pretty-ify the response in the application layer.

Summary

In this blog, we explored the process of implementing semantic search at an enterprise COE.

Model Selection: ELSER is a retrieval model that is designed for out-of-domain use cases, making it an easy choice for search COEs that prioritizes flexibility, speed, and streamlined implementation.
Schema Design: Unlike pure vector databases, Elasticsearch can perform traditional BM25 searches as well, and filter and aggregate the metadata natively along with the vector searches all within the same context database and query. The result is the ability to significantly simplify application complexity and the number of tools required vs a pure vector database.

Using the nested field type for chunks of the original text and their corresponding sparse vector embeddings created with ELSER results in an efficient way to store the data and search the data.
Data Ingestion: The Elasticsearch inference processor allows us to "chunk" and create sparse vector embedding as we ingest the documents. This can all be accomplished within Elasticsearch, eliminating the need for additional tools and preprocessing. Native chunking is coming soon in Elasticsearch.
Access Control: Elasticsearch has strong native RBAC controls to limit / control access, including field and document level access control, which greatly reduces the operational and maintenance complexity required to secure the data and application.
Search Techniques: Semantic search is an iterative process that may require tuning or combining more than one search technique to produce the most relevant results. In this case, we found that just a pure semantic text expansion search did not provide the best relevance. So, we combined the semantic search and traditional BM25 search on the key phrases to filter the documents and provide the best results. There are additional techniques such as reciprocal rank fusion which could be used, although they did not provide better results in this use case.

As a reminder, all code and content for this blog can be found here.

Additional Considerations

One thing we have omitted is how to evaluate search relevance in a scalable and systematic way. For our use case, we have a test dataset, queries, and expected results to measure quantitatively “what is good”. We highly recommend having a look at Improving search relevance with data-driven query optimization for a robust approach.

Also, tuning semantic search relevance for our use case is an iterative process. For many search COEs, it makes sense to start fast, adjust course, and adopt different techniques when required. LLMs are an astonishingly pliable tool in our toolset to provide the next-level search experience to our users.

Next Blog

In the next part of this blog, we will look at how to expose our Elasticsearch queries to developers in an optimized and simple way to enable them to incorporate search functionality into applications quickly. As part of that, we will provide an example of keyword extraction, which can improve search performance

Ready to build RAG into your apps? Want to try different LLMs with a vector database?
Check out our sample notebooks for LangChain, Cohere and more on Github, and join the Elasticsearch Engineer training starting soon!