Create an inference endpoint | Elasticsearch API documentation (v8)

Create an inference endpoint Generally available; Added in 8.11.0

PUT /_inference/{task_type}/{inference_id}

All methods and paths for this operation:

PUT /_inference/{inference_id}

PUT /_inference/{task_type}/{inference_id}

IMPORTANT: The inference APIs enable you to use certain services, such as built-in machine learning models (ELSER, E5), models uploaded through Eland, Cohere, OpenAI, Mistral, Azure OpenAI, Google AI Studio, Google Vertex AI, Anthropic, Watsonx.ai, or Hugging Face. For built-in models and models uploaded through Eland, the inference APIs offer an alternative way to use and manage trained models. However, if you do not plan to use the inference APIs to use these models or if you want to use non-NLP models, use the machine learning trained model APIs.

The following integrations are available through the inference API. You can find the available task types next to the integration name:

AlibabaCloud AI Search (completion, rerank, sparse_embedding, text_embedding)
Amazon Bedrock (completion, text_embedding)
Amazon SageMaker (chat_completion, completion, rerank, sparse_embedding, text_embedding)
Anthropic (completion)
Azure AI Studio (completion, text_embedding)
Azure OpenAI (completion, text_embedding)
Cohere (completion, rerank, text_embedding)
DeepSeek (chat_completion, completion)
Elasticsearch (rerank, sparse_embedding, text_embedding - this service is for built-in models and models uploaded through Eland)
ELSER (sparse_embedding)
Google AI Studio (completion, text_embedding)
Google Vertex AI (chat_completion, completion, rerank, text_embedding)
Hugging Face (chat_completion, completion, rerank, text_embedding)
JinaAI (rerank, text_embedding)
Llama (chat_completion, completion, text_embedding)
Mistral (chat_completion, completion, text_embedding)
OpenAI (chat_completion, completion, text_embedding)
VoyageAI (rerank, text_embedding)
Watsonx (rerank, text_embedding)

Required authorization

Cluster privileges: manage_inference

Path parameters

task_type string Required

The task type. Refer to the integration list in the API description for the available task types.

Values are sparse_embedding, text_embedding, rerank, completion, or chat_completion.
inference_id string Required

The inference Id

Query parameters

timeout string

Specifies the amount of time to wait for the inference endpoint to be created.

External documentation

application/json

Body Required

chunking_settings object

Chunking configuration object
Hide chunking_settings attributes Show chunking_settings attributes object
- max_chunk_size number
  
  The maximum size of a chunk in words. This value cannot be higher than 300 or lower than 20 (for sentence strategy) or 10 (for word strategy).
  
  Default value is 250.
- overlap number
  
  The number of overlapping words for chunks. It is applicable only to a word chunking strategy. This value cannot be higher than half the max_chunk_size value.
  
  Default value is 100.
- sentence_overlap number
  
  The number of overlapping sentences for chunks. It is applicable only for a sentence chunking strategy. It can be either 1 or 0.
  
  Default value is 1.
- strategy string
  
  The chunking strategy: sentence or word.
  
  Default value is sentence.
service string Required

The service type
service_settings object Required

Settings specific to the service
task_settings object

Task settings specific to the service and task type

Responses

200 application/json
Hide response attributes Show response attributes object
Represents an inference endpoint as returned by the GET API
- chunking_settings object
  
  Chunking configuration object
  
  Hide chunking_settings attributes Show chunking_settings attributes object
  
  max_chunk_size number
  
  The maximum size of a chunk in words. This value cannot be higher than 300 or lower than 20 (for sentence strategy) or 10 (for word strategy).
  
  Default value is 250.
  
  overlap number
  
  The number of overlapping words for chunks. It is applicable only to a word chunking strategy. This value cannot be higher than half the max_chunk_size value.
  
  Default value is 100.
  
  sentence_overlap number
  
  The number of overlapping sentences for chunks. It is applicable only for a sentence chunking strategy. It can be either 1 or 0.
  
  Default value is 1.
  
  strategy string
  
  The chunking strategy: sentence or word.
  
  Default value is sentence.
- service string Required
  
  The service type
- service_settings object Required
  
  Settings specific to the service
- task_settings object
  
  Task settings specific to the service and task type
- inference_id string Required
  
  The inference Id
- task_type string Required
  
  The task type
  
  Values are sparse_embedding, text_embedding, rerank, completion, or chat_completion.

PUT /_inference/{task_type}/{inference_id}

PUT _inference/rerank/my-rerank-model
{
 "service": "cohere",
 "service_settings": {
   "model_id": "rerank-english-v3.0",
   "api_key": "{{COHERE_API_KEY}}"
 }
}

resp = client.inference.put(
    task_type="rerank",
    inference_id="my-rerank-model",
    inference_config={
        "service": "cohere",
        "service_settings": {
            "model_id": "rerank-english-v3.0",
            "api_key": "{{COHERE_API_KEY}}"
        }
    },
)

const response = await client.inference.put({
  task_type: "rerank",
  inference_id: "my-rerank-model",
  inference_config: {
    service: "cohere",
    service_settings: {
      model_id: "rerank-english-v3.0",
      api_key: "{{COHERE_API_KEY}}",
    },
  },
});

response = client.inference.put(
  task_type: "rerank",
  inference_id: "my-rerank-model",
  body: {
    "service": "cohere",
    "service_settings": {
      "model_id": "rerank-english-v3.0",
      "api_key": "{{COHERE_API_KEY}}"
    }
  }
)

$resp = $client->inference()->put([
    "task_type" => "rerank",
    "inference_id" => "my-rerank-model",
    "body" => [
        "service" => "cohere",
        "service_settings" => [
            "model_id" => "rerank-english-v3.0",
            "api_key" => "{{COHERE_API_KEY}}",
        ],
    ],
]);

curl -X PUT -H "Authorization: ApiKey $ELASTIC_API_KEY" -H "Content-Type: application/json" -d '{"service":"cohere","service_settings":{"model_id":"rerank-english-v3.0","api_key":"{{COHERE_API_KEY}}"}}' "$ELASTICSEARCH_URL/_inference/rerank/my-rerank-model"

client.inference().put(p -> p
    .inferenceId("my-rerank-model")
    .taskType(TaskType.Rerank)
    .inferenceConfig(i -> i
        .service("cohere")
        .serviceSettings(JsonData.fromJson("{\"model_id\":\"rerank-english-v3.0\",\"api_key\":\"{{COHERE_API_KEY}}\"}"))
    )
);

Request example

An example body for a `PUT _inference/rerank/my-rerank-model` request.

{
 "service": "cohere",
 "service_settings": {
   "model_id": "rerank-english-v3.0",
   "api_key": "{{COHERE_API_KEY}}"
 }
}