Semantic search with semantic_text
This tutorial shows you how to use the semantic text feature to perform semantic search on your data.
Semantic text simplifies the inference workflow by providing inference at ingestion time and sensible default values automatically. You don’t need to define model related settings and parameters, or create inference ingest pipelines.
The recommended way to use semantic search in the Elastic Stack is following the semantic_text workflow. When you need more control over indexing and query settings, you can still use the complete inference workflow (refer to this tutorial to review the process).
This tutorial uses the elasticsearch service for demonstration, but you can use any service and their supported models offered by the Inference API.
This tutorial uses the elasticsearch service for demonstration, which is created automatically as needed. To use the semantic_text field type with an inference service other than elasticsearch service, you must create an inference endpoint using the Create inference API.
The mapping of the destination index - the index that contains the embeddings that the inference endpoint will generate based on your input text - must be created. The destination index must have a field with the semantic_text field type to index the output of the used inference endpoint.
You can run inference either using the Elastic Inference Service or on your own ML-nodes. The following examples show you both scenarios.
PUT semantic-embeddings
{
"mappings": {
"properties": {
"content": {
"type": "semantic_text"
}
}
}
}
- The name of the field to contain the generated embeddings.
- The field to contain the embeddings is a
semantic_textfield. Since noinference_idis provided, the default endpoint.elser-2-elasticfor theelasticsearchservice is used. This inference endpoint uses the Elastic Inference Service (EIS).
PUT semantic-embeddings
{
"mappings": {
"properties": {
"content": {
"type": "semantic_text",
"inference_id": ".elser-v2-elastic"
}
}
}
}
- The name of the field to contain the generated embeddings.
- The field to contain the embeddings is a
semantic_textfield. - The
.elser-v2-elasticpreconfigured inference endpoint for theelasticsearchservice is used. This inference endpoint uses the Elastic Inference Service (EIS).
PUT semantic-embeddings
{
"mappings": {
"properties": {
"content": {
"type": "semantic_text",
"inference_id": ".elser-2-elasticsearch"
}
}
}
}
- The name of the field to contain the generated embeddings.
- The field to contain the embeddings is a
semantic_textfield. - The
.elser-2-elasticsearchpreconfigured inference endpoint for theelasticsearchservice is used. To use a different inference service, you must create an inference endpoint first using the Create inference API and then specify it in thesemantic_textfield mapping using theinference_idparameter.
To try the ELSER model on the Elastic Inference Service, explicitly set the inference_id to .elser-2-elastic. For instructions, refer to Using semantic_text with ELSER on EIS.
When using semantic_text with dense vector embeddings (such as E5 or other text embedding models), you can optimize storage and search performance by configuring index_options on the underlying dense_vector field. This is particularly useful for large-scale deployments. The index_options parameter is only applicable when using inference endpoints that produce dense vector embeddings (like E5, OpenAI embeddings, Cohere embeddings, and others). It does not apply to sparse vector models like ELSER, which use a different internal representation.
The index_options parameter controls how vectors are indexed and stored. For dense vector embeddings, you can specify quantization strategies like Better Binary Quantization (BBQ) that significantly reduce memory footprint while maintaining search quality. Quantization compresses high-dimensional vectors into more efficient representations, enabling faster searches and reduced memory consumption. For details on available options and their trade-offs, refer to the dense_vector index_options documentation.
For most production use cases using semantic_text with dense vector embeddings from text models (like E5, OpenAI, or Cohere), BBQ is recommended as it provides up to 32x memory reduction with minimal accuracy loss. BBQ requires a minimum of 64 dimensions and works best with text embeddings (it might not perform well with other types like image embeddings). Choose from:
bbq_hnsw- Best for most use cases (default for 384+ dimensions)bbq_flat- BBQ without HNSW for smaller datasetsbbq_disk- Disk-based storage for large datasets with minimal memory requirements
Here's an example using semantic_text with a text embedding inference endpoint and BBQ quantization:
PUT semantic-embeddings-optimized
{
"mappings": {
"properties": {
"content": {
"type": "semantic_text",
"inference_id": ".multilingual-e5-small-elasticsearch",
"index_options": {
"dense_vector": {
"type": "bbq_hnsw"
}
}
}
}
}
}
- Reference to a text embedding inference endpoint. This example uses the built-in E5 endpoint that is automatically available. For custom models, you must create the endpoint first using the Create inference API.
- Use Better Binary Quantization with HNSW indexing for optimal memory efficiency. This setting applies to the underlying
dense_vectorfield that stores the embeddings.
You can also use bbq_flat for smaller datasets where you need maximum accuracy at the expense of speed:
PUT semantic-embeddings-flat
{
"mappings": {
"properties": {
"content": {
"type": "semantic_text",
"inference_id": ".multilingual-e5-small-elasticsearch",
"index_options": {
"dense_vector": {
"type": "bbq_flat"
}
}
}
}
}
}
- Use BBQ without HNSW for smaller datasets. This uses brute-force search and requires less compute resources during indexing but more during querying.
For large datasets where RAM is constrained, use bbq_disk (DiskBBQ) to minimize memory usage:
PUT semantic-embeddings-disk
{
"mappings": {
"properties": {
"content": {
"type": "semantic_text",
"inference_id": ".multilingual-e5-small-elasticsearch",
"index_options": {
"dense_vector": {
"type": "bbq_disk"
}
}
}
}
}
}
- Use DiskBBQ when RAM is limited. Available in Elasticsearch 9.2+, this option keeps vectors in compressed form on disk and only loads/decompresses small portions on-demand during queries. Unlike standard HNSW indexes (which rely on filesystem cache to load vectors into memory for fast search), DiskBBQ dramatically reduces RAM requirements by avoiding the need to cache vectors in memory. This enables vector search on much larger datasets with minimal memory, though queries will be slower compared to in-memory approaches.
Other quantization options include int8_hnsw (8-bit integer quantization) and int4_hnsw (4-bit integer quantization):
PUT semantic-embeddings-int8
{
"mappings": {
"properties": {
"content": {
"type": "semantic_text",
"inference_id": ".multilingual-e5-small-elasticsearch",
"index_options": {
"dense_vector": {
"type": "int8_hnsw"
}
}
}
}
}
}
- Use 8-bit integer quantization for 4x memory reduction with high accuracy retention. For 4-bit quantization, use
"type": "int4_hnsw"instead, which provides up to 8x memory reduction. For the full list of other available quantization options (includingint4_flatand others), refer to thedense_vectorindex_optionsdocumentation.
For HNSW-specific tuning parameters like m and ef_construction, you can include them in the index_options:
PUT semantic-embeddings-custom
{
"mappings": {
"properties": {
"content": {
"type": "semantic_text",
"inference_id": ".multilingual-e5-small-elasticsearch",
"index_options": {
"dense_vector": {
"type": "bbq_hnsw",
"m": 32,
"ef_construction": 200
}
}
}
}
}
}
- The number of neighbors each node will be connected to in the HNSW graph. Higher values improve recall but increase memory usage. Default is 16.
- Number of candidates considered during graph construction. Higher values improve index quality but slow down indexing. Default is 100.
If you're using web crawlers or connectors to generate indices, you have to update the index mappings for these indices to include the semantic_text field. Once the mapping is updated, you'll need to run a full web crawl or a full connector sync. This ensures that all existing documents are reprocessed and updated with the new semantic embeddings, enabling semantic search on the updated data.
In this step, you load the data that you later use to create embeddings from it.
Use the msmarco-passagetest2019-top1000 data set, which is a subset of the MS MARCO Passage Ranking data set. It consists of 200 queries, each accompanied by a list of relevant text passages. All unique passages, along with their IDs, have been extracted from that data set and compiled into a tsv file.
Download the file and upload it to your cluster using the Data Visualizer in the Machine Learning UI. After your data is analyzed, click Override settings. Under Edit field names, assign id to the first column and content to the second. Click Apply, then Import. Name the index test-data, and click Import. After the upload is complete, you will see an index named test-data with 182,469 documents.
Create the embeddings from the text by reindexing the data from the test-data index to the semantic-embeddings index. The data in the content field will be reindexed into the content semantic text field of the destination index. The reindexed data will be processed by the inference endpoint associated with the content semantic text field.
This step uses the reindex API to simulate data ingestion. If you are working with data that has already been indexed, rather than using the test-data set, reindexing is required to ensure that the data is processed by the inference endpoint and the necessary embeddings are generated.
POST _reindex?wait_for_completion=false
{
"source": {
"index": "test-data",
"size": 10
},
"dest": {
"index": "semantic-embeddings"
}
}
- The default batch size for reindexing is 1000. Reducing size to a smaller number makes the update of the reindexing process quicker which enables you to follow the progress closely and detect errors early.
The call returns a task ID to monitor the progress:
GET _tasks/<task_id>
Reindexing large datasets can take a long time. You can test this workflow using only a subset of the dataset. Do this by cancelling the reindexing process, and only generating embeddings for the subset that was reindexed. The following API request will cancel the reindexing task:
POST _tasks/<task_id>/_cancel
After the data has been indexed with the embeddings, you can query the data using semantic search. Choose between Query DSL or ES|QL syntax to execute the query.
The Query DSL approach uses the match query type with the semantic_text field:
GET semantic-embeddings/_search
{
"query": {
"match": {
"content": {
"query": "What causes muscle soreness after running?"
}
}
}
}
- The
semantic_textfield on which you want to perform the search. - The query text.
The ES|QL approach uses the match (:) operator, which automatically detects the semantic_text field and performs the search on it. The query uses METADATA _score to sort by _score in descending order.
POST /_query?format=txt
{
"query": """
FROM semantic-embeddings METADATA _score
| WHERE content: "How to avoid muscle soreness while running?"
| SORT _score DESC
| LIMIT 1000
"""
}
- The
METADATA _scoreclause is used to return the score of each document - The match (
:) operator is used on thecontentfield for standard keyword matching - Sorts by descending score to display the most relevant results first
- Limits the results to 1000 documents
- For an overview of all query types supported by
semantic_textfields and guidance on when to use them, see Queryingsemantic_textfields. - If you want to use
semantic_textin hybrid search, refer to this notebook for a step-by-step guide. - For more information on how to optimize your ELSER endpoints, refer to the ELSER recommendations section in the model documentation.
- To learn more about model autoscaling, refer to the trained model autoscaling page.