Elasticsearch inference integration
editElasticsearch inference integration
editCreates an inference endpoint to perform an inference task with the elasticsearch service.
- Your Elasticsearch deployment contains preconfigured ELSER and E5 inference endpoints, you only need to create the enpoints using the API if you want to customize the settings.
-
If you use the ELSER or the E5 model through the
elasticsearchservice, the API request will automatically download and deploy the model if it isn’t downloaded yet.
Request
editPUT /_inference/<task_type>/<inference_id>
Path parameters
edit-
<inference_id> - (Required, string) The unique identifier of the inference endpoint.
-
<task_type> -
(Required, string) The type of the inference task that the model will perform.
Available task types:
-
rerank, -
sparse_embedding, -
text_embedding.
-
Request body
edit-
chunking_settings -
(Optional, object) Chunking configuration object. Refer to Configuring chunking to learn more about chunking.
-
max_chunk_size -
(Optional, integer)
Specifies the maximum size of a chunk in words.
Defaults to
250. This value cannot be higher than300or lower than20(forsentencestrategy) or10(forwordstrategy). -
overlap -
(Optional, integer)
Only for
wordchunking strategy. Specifies the number of overlapping words for chunks. Defaults to100. This value cannot be higher than the half ofmax_chunk_size. -
sentence_overlap -
(Optional, integer)
Only for
sentencechunking strategy. Specifies the numnber of overlapping sentences for chunks. It can be either1or0. Defaults to1. -
strategy -
(Optional, string)
Specifies the chunking strategy.
It could be either
sentenceorword.
-
-
service -
(Required, string)
The type of service supported for the specified task type. In this case,
elasticsearch. -
service_settings -
(Required, object) Settings used to install the inference model.
These settings are specific to the
elasticsearchservice.-
deployment_id -
(Optional, string)
The
deployment_idof an existing trained model deployment. Whendeployment_idis used themodel_idis optional. -
adaptive_allocations -
(Optional, object) Adaptive allocations configuration object. If enabled, the number of allocations of the model is set based on the current load the process gets. When the load is high, a new model allocation is automatically created (respecting the value of
max_number_of_allocationsif it’s set). When the load is low, a model allocation is automatically removed (respecting the value ofmin_number_of_allocationsif it’s set). Ifadaptive_allocationsis enabled, do not set the number of allocations manually.-
enabled -
(Optional, Boolean)
If
true,adaptive_allocationsis enabled. Defaults tofalse. -
max_number_of_allocations -
(Optional, integer)
Specifies the maximum number of allocations to scale to.
If set, it must be greater than or equal to
min_number_of_allocations. -
min_number_of_allocations -
(Optional, integer)
Specifies the minimum number of allocations to scale to.
If set, it must be greater than or equal to
0. If not defined, the deployment scales to0.
-
-
model_id -
(Required, string)
The name of the model to use for the inference task.
It can be the ID of either a built-in model (for example,
.multilingual-e5-smallfor E5), a text embedding model already uploaded through Eland. -
num_allocations -
(Required, integer)
The total number of allocations this model is assigned across machine learning nodes.
Increasing this value generally increases the throughput.
If
adaptive_allocationsis enabled, do not set this value, because it’s automatically set. -
num_threads -
(Required, integer)
Sets the number of threads used by each model allocation during inference. This generally increases the speed per inference request. The inference process is a compute-bound process;
threads_per_allocationsmust not exceed the number of available allocated processors per node. Must be a power of 2. Max allowed value is 32.
-
-
task_settings -
(Optional, object) Settings to configure the inference task. These settings are specific to the
<task_type>you specified.task_settingsfor thereranktask type-
return_documents -
(Optional, Boolean)
Returns the document instead of only the index. Defaults to
true.
-
ELSER via the elasticsearch service
editThe following example shows how to create an inference endpoint called my-elser-model to perform a sparse_embedding task type.
The API request below will automatically download the ELSER model if it isn’t already downloaded and then deploy the model.
resp = client.inference.put(
task_type="sparse_embedding",
inference_id="my-elser-model",
inference_config={
"service": "elasticsearch",
"service_settings": {
"adaptive_allocations": {
"enabled": True,
"min_number_of_allocations": 1,
"max_number_of_allocations": 4
},
"num_threads": 1,
"model_id": ".elser_model_2"
}
},
)
print(resp)
const response = await client.inference.put({
task_type: "sparse_embedding",
inference_id: "my-elser-model",
inference_config: {
service: "elasticsearch",
service_settings: {
adaptive_allocations: {
enabled: true,
min_number_of_allocations: 1,
max_number_of_allocations: 4,
},
num_threads: 1,
model_id: ".elser_model_2",
},
},
});
console.log(response);
PUT _inference/sparse_embedding/my-elser-model
{
"service": "elasticsearch",
"service_settings": {
"adaptive_allocations": {
"enabled": true,
"min_number_of_allocations": 1,
"max_number_of_allocations": 4
},
"num_threads": 1,
"model_id": ".elser_model_2"
}
}
|
Adaptive allocations will be enabled with the minimum of 1 and the maximum of 10 allocations. |
|
|
The |
Elastic Rerank via the elasticsearch service
editThe following example shows how to create an inference endpoint called my-elastic-rerank to perform a rerank task type using the built-in Elastic Rerank cross-encoder model.
The API request below will automatically download the Elastic Rerank model if it isn’t already downloaded and then deploy the model.
Once deployed, the model can be used for semantic re-ranking with a text_similarity_reranker retriever.
resp = client.inference.put(
task_type="rerank",
inference_id="my-elastic-rerank",
inference_config={
"service": "elasticsearch",
"service_settings": {
"model_id": ".rerank-v1",
"num_threads": 1,
"adaptive_allocations": {
"enabled": True,
"min_number_of_allocations": 1,
"max_number_of_allocations": 4
}
}
},
)
print(resp)
const response = await client.inference.put({
task_type: "rerank",
inference_id: "my-elastic-rerank",
inference_config: {
service: "elasticsearch",
service_settings: {
model_id: ".rerank-v1",
num_threads: 1,
adaptive_allocations: {
enabled: true,
min_number_of_allocations: 1,
max_number_of_allocations: 4,
},
},
},
});
console.log(response);
PUT _inference/rerank/my-elastic-rerank
{
"service": "elasticsearch",
"service_settings": {
"model_id": ".rerank-v1",
"num_threads": 1,
"adaptive_allocations": {
"enabled": true,
"min_number_of_allocations": 1,
"max_number_of_allocations": 4
}
}
}
|
The |
|
|
Adaptive allocations will be enabled with the minimum of 1 and the maximum of 10 allocations. |
E5 via the elasticsearch service
editThe following example shows how to create an inference endpoint called my-e5-model to perform a text_embedding task type.
The API request below will automatically download the E5 model if it isn’t already downloaded and then deploy the model.
resp = client.inference.put(
task_type="text_embedding",
inference_id="my-e5-model",
inference_config={
"service": "elasticsearch",
"service_settings": {
"num_allocations": 1,
"num_threads": 1,
"model_id": ".multilingual-e5-small"
}
},
)
print(resp)
const response = await client.inference.put({
task_type: "text_embedding",
inference_id: "my-e5-model",
inference_config: {
service: "elasticsearch",
service_settings: {
num_allocations: 1,
num_threads: 1,
model_id: ".multilingual-e5-small",
},
},
});
console.log(response);
PUT _inference/text_embedding/my-e5-model
{
"service": "elasticsearch",
"service_settings": {
"num_allocations": 1,
"num_threads": 1,
"model_id": ".multilingual-e5-small"
}
}
|
The |
You might see a 502 bad gateway error in the response when using the Kibana Console.
This error usually just reflects a timeout, while the model downloads in the background.
You can check the download progress in the Machine Learning UI.
If using the Python client, you can set the timeout parameter to a higher value.
Models uploaded by Eland via the elasticsearch service
editThe following example shows how to create an inference endpoint called
my-msmarco-minilm-model to perform a text_embedding task type.
resp = client.inference.put(
task_type="text_embedding",
inference_id="my-msmarco-minilm-model",
inference_config={
"service": "elasticsearch",
"service_settings": {
"num_allocations": 1,
"num_threads": 1,
"model_id": "msmarco-MiniLM-L12-cos-v5"
}
},
)
print(resp)
const response = await client.inference.put({
task_type: "text_embedding",
inference_id: "my-msmarco-minilm-model",
inference_config: {
service: "elasticsearch",
service_settings: {
num_allocations: 1,
num_threads: 1,
model_id: "msmarco-MiniLM-L12-cos-v5",
},
},
});
console.log(response);
PUT _inference/text_embedding/my-msmarco-minilm-model { "service": "elasticsearch", "service_settings": { "num_allocations": 1, "num_threads": 1, "model_id": "msmarco-MiniLM-L12-cos-v5" } }
|
Provide an unique identifier for the inference endpoint. The |
|
|
The |
Setting adaptive allocation for E5 via the elasticsearch service
editThe following example shows how to create an inference endpoint called
my-e5-model to perform a text_embedding task type and configure adaptive
allocations.
The API request below will automatically download the E5 model if it isn’t already downloaded and then deploy the model.
resp = client.inference.put(
task_type="text_embedding",
inference_id="my-e5-model",
inference_config={
"service": "elasticsearch",
"service_settings": {
"adaptive_allocations": {
"enabled": True,
"min_number_of_allocations": 3,
"max_number_of_allocations": 10
},
"num_threads": 1,
"model_id": ".multilingual-e5-small"
}
},
)
print(resp)
const response = await client.inference.put({
task_type: "text_embedding",
inference_id: "my-e5-model",
inference_config: {
service: "elasticsearch",
service_settings: {
adaptive_allocations: {
enabled: true,
min_number_of_allocations: 3,
max_number_of_allocations: 10,
},
num_threads: 1,
model_id: ".multilingual-e5-small",
},
},
});
console.log(response);
PUT _inference/text_embedding/my-e5-model
{
"service": "elasticsearch",
"service_settings": {
"adaptive_allocations": {
"enabled": true,
"min_number_of_allocations": 3,
"max_number_of_allocations": 10
},
"num_threads": 1,
"model_id": ".multilingual-e5-small"
}
}
Using an existing model deployment with the elasticsearch service
editThe following example shows how to use an already existing model deployment when creating an inference endpoint.
resp = client.inference.put(
task_type="sparse_embedding",
inference_id="use_existing_deployment",
inference_config={
"service": "elasticsearch",
"service_settings": {
"deployment_id": ".elser_model_2"
}
},
)
print(resp)
const response = await client.inference.put({
task_type: "sparse_embedding",
inference_id: "use_existing_deployment",
inference_config: {
service: "elasticsearch",
service_settings: {
deployment_id: ".elser_model_2",
},
},
});
console.log(response);
PUT _inference/sparse_embedding/use_existing_deployment
{
"service": "elasticsearch",
"service_settings": {
"deployment_id": ".elser_model_2"
}
}
The API response contains the model_id, and the threads and allocations settings from the model deployment:
{
"inference_id": "use_existing_deployment",
"task_type": "sparse_embedding",
"service": "elasticsearch",
"service_settings": {
"num_allocations": 2,
"num_threads": 1,
"model_id": ".elser_model_2",
"deployment_id": ".elser_model_2"
},
"chunking_settings": {
"strategy": "sentence",
"max_chunk_size": 250,
"sentence_overlap": 1
}
}