Create inference APIedit

This functionality is in technical preview and may be changed or removed in a future release. Elastic will work to fix any issues, but features in technical preview are not subject to the support SLA of official GA features.

Creates an inference endpoint to perform an inference task.

The inference APIs enable you to use certain services, such as built-in machine learning models (ELSER, E5), models uploaded through Eland, Cohere, OpenAI, Azure OpenAI, Google AI Studio or Hugging Face. For built-in models and models uploaded though Eland, the inference APIs offer an alternative way to use and manage trained models. However, if you do not plan to use the inference APIs to use these models or if you want to use non-NLP models, use the Machine learning trained model APIs.


PUT /_inference/<task_type>/<inference_id>


  • Requires the manage_inference cluster privilege (the built-in inference_admin role grants this privilege)


The create inference API enables you to create an inference endpoint and configure a machine learning model to perform a specific inference task.

The following services are available through the inference API:

  • Azure AI Studio
  • Azure OpenAI
  • Cohere
  • Elasticsearch (for built-in models and models uploaded through Eland)
  • Google AI Studio
  • Hugging Face
  • OpenAI

Path parametersedit

(Required, string) The unique identifier of the inference endpoint.

(Required, string) The type of the inference task that the model will perform. Available task types:

  • completion,
  • rerank,
  • sparse_embedding,
  • text_embedding.

Request bodyedit


(Required, string) The type of service supported for the specified task type. Available services:

  • azureopenai: specify the completion or text_embedding task type to use the Azure OpenAI service.
  • azureaistudio: specify the completion or text_embedding task type to use the Azure AI Studio service.
  • cohere: specify the completion, text_embedding or the rerank task type to use the Cohere service.
  • elasticsearch: specify the text_embedding task type to use the E5 built-in model or text embedding models uploaded by Eland.
  • elser: specify the sparse_embedding task type to use the ELSER service.
  • googleaistudio: specify the completion task to use the Google AI Studio service.
  • hugging_face: specify the text_embedding task type to use the Hugging Face service.
  • openai: specify the completion or text_embedding task type to use the OpenAI service.

(Required, object) Settings used to install the inference model. These settings are specific to the service you specified.

service_settings for the azureaistudio service
(Required, string) A valid API key of your Azure AI Studio model deployment. This key can be found on the overview page for your deployment in the management section of your Azure AI Studio account.

You need to provide the API key only once, during the inference model creation. The Get inference API does not retrieve your API key. After creating the inference model, you cannot change the associated API key. If you want to use a different API key, delete the inference model and recreate it with the same name and the updated API key.

(Required, string) The target URL of your Azure AI Studio model deployment. This can be found on the overview page for your deployment in the management section of your Azure AI Studio account.

(Required, string) The model provider for your deployment. Note that some providers may support only certain task types. Supported providers include:

  • cohere - available for text_embedding and completion task types
  • databricks - available for completion task type only
  • meta - available for completion task type only
  • microsoft_phi - available for completion task type only
  • mistral - available for completion task type only
  • openai - available for text_embedding and completion task types
(Required, string) One of token or realtime. Specifies the type of endpoint that is used in your model deployment. There are two endpoint types available for deployment through Azure AI Studio. "Pay as you go" endpoints are billed per token. For these, you must specify token for your endpoint_type. For "real-time" endpoints which are billed per hour of usage, specify realtime.
(Optional, object) By default, the azureaistudio service sets the number of requests allowed per minute to 240. This helps to minimize the number of rate limit errors returned from Azure AI Studio. To modify this, set the requests_per_minute setting of this object in your service settings:
"rate_limit": {
    "requests_per_minute": <<number_of_requests>>
service_settings for the azureopenai service
api_key or entra_id
(Required, string) You must provide either an API key or an Entra ID. If you do not provide either, or provide both, you will receive an error when trying to create your model. See the Azure OpenAI Authentication documentation for more details on these authentication types.

You need to provide the API key or Entra ID only once, during the inference model creation. The Get inference API does not retrieve your authentication credentials. After creating the inference model, you cannot change the associated API key or Entra ID. If you want to use a different API key or Entra ID, delete the inference model and recreate it with the same name and the updated API key. You must have either an api_key or an entra_id defined. If neither are present, an error will occur.

(Required, string) The name of your Azure OpenAI resource. You can find this from the list of resources in the Azure Portal for your subscription.
(Required, string) The deployment name of your deployed models. Your Azure OpenAI deployments can be found though the Azure OpenAI Studio portal that is linked to your subscription.
(Required, string) The Azure API version ID to use. We recommend using the latest supported non-preview version.
service_settings for the cohere service
(Required, string) A valid API key of your Cohere account. You can find your Cohere API keys or you can create a new one on the API keys settings page.

You need to provide the API key only once, during the inference model creation. The Get inference API does not retrieve your API key. After creating the inference model, you cannot change the associated API key. If you want to use a different API key, delete the inference model and recreate it with the same name and the updated API key.


(Optional, string) Only for text_embedding. Specifies the types of embeddings you want to get back. Defaults to float. Valid values are:

  • byte: use it for signed int8 embeddings (this is a synonym of int8).
  • float: use it for the default float embeddings.
  • int8: use it for signed int8 embeddings.
(Optional, string) The name of the model to use for the inference task. To review the available rerank models, refer to the Cohere docs.

To review the available text_embedding models, refer to the Cohere docs. The default value for text_embedding is embed-english-v2.0.

service_settings for the elasticsearch service
(Required, string) The name of the model to use for the inference task. It can be the ID of either a built-in model (for example, .multilingual-e5-small for E5) or a text embedding model already uploaded through Eland.
(Required, integer) The number of model allocations to create. num_allocations must not exceed the number of available processors per node divided by the num_threads.
(Required, integer) The number of threads to use by each model allocation. num_threads must not exceed the number of available processors per node divided by the number of allocations. Must be a power of 2. Max allowed value is 32.
service_settings for the elser service
(Required, integer) The number of model allocations to create. num_allocations must not exceed the number of available processors per node divided by the num_threads.
(Required, integer) The number of threads to use by each model allocation. num_threads must not exceed the number of available processors per node divided by the number of allocations. Must be a power of 2. Max allowed value is 32.
service_settings for the googleiastudio service
(Required, string) A valid API key for the Google Gemini API.
(Required, string) The name of the model to use for the inference task. You can find the supported models at Gemini API models.

(Optional, object) By default, the googleaistudio service sets the number of requests allowed per minute to 360. This helps to minimize the number of rate limit errors returned from Google AI Studio. To modify this, set the requests_per_minute setting of this object in your service settings:

"rate_limit": {
    "requests_per_minute": <<number_of_requests>>
service_settings for the hugging_face service
(Required, string) A valid access token of your Hugging Face account. You can find your Hugging Face access tokens or you can create a new one on the settings page.

You need to provide the API key only once, during the inference model creation. The Get inference API does not retrieve your API key. After creating the inference model, you cannot change the associated API key. If you want to use a different API key, delete the inference model and recreate it with the same name and the updated API key.

(Required, string) The URL endpoint to use for the requests.
service_settings for the openai service
(Required, string) A valid API key of your OpenAI account. You can find your OpenAI API keys in your OpenAI account under the API keys section.

You need to provide the API key only once, during the inference model creation. The Get inference API does not retrieve your API key. After creating the inference model, you cannot change the associated API key. If you want to use a different API key, delete the inference model and recreate it with the same name and the updated API key.

(Required, string) The name of the model to use for the inference task. Refer to the OpenAI documentation for the list of available text embedding models.
(Optional, string) The unique identifier of your organization. You can find the Organization ID in your OpenAI account under Settings > Organizations.
(Optional, string) The URL endpoint to use for the requests. Can be changed for testing purposes. Defaults to

(Optional, object) Settings to configure the inference task. These settings are specific to the <task_type> you specified.

task_settings for the completion task type
(Optional, float) For the azureaistudio service only. Instructs the inference process to perform sampling or not. Has not affect unless temperature or top_p is specified.
(Optional, integer) For the azureaistudio service only. Provides a hint for the maximum number of output tokens to be generated. Defaults to 64.
(Optional, string) For openai service only. Specifies the user issuing the request, which can be used for abuse detection.
(Optional, float) For the azureaistudio service only. A number in the range of 0.0 to 2.0 that specifies the sampling temperature to use that controls the apparent creativity of generated completions. Should not be used if top_p is specified.
(Optional, float) For the azureaistudio service only. A number in the range of 0.0 to 2.0 that is an alternative value to temperature that causes the model to consider the results of the tokens with nucleus sampling probability. Should not be used if temperature is specified.
task_settings for the rerank task type
(Optional, boolean) For cohere service only. Specify whether to return doc text within the results.
(Optional, integer) The number of most relevant documents to return, defaults to the number of the documents.
task_settings for the text_embedding task type

(Optional, string) For cohere service only. Specifies the type of input passed to the model. Valid values are:

  • classification: use it for embeddings passed through a text classifier.
  • clusterning: use it for the embeddings run through a clustering algorithm.
  • ingest: use it for storing document embeddings in a vector database.
  • search: use it for storing embeddings of search queries run against a vector database to find relevant documents.

(Optional, string) For cohere service only. Specifies how the API handles inputs longer than the maximum token length. Defaults to END. Valid values are:

  • NONE: when the input exceeds the maximum input token length an error is returned.
  • START: when the input exceeds the maximum input token length the start of the input is discarded.
  • END: when the input exceeds the maximum input token length the end of the input is discarded.
(optional, string) For openai, azureopenai and azureaistudio services only. Specifies the user issuing the request, which can be used for abuse detection.


This section contains example API calls for every service type.

Azure AI Studio serviceedit

The following example shows how to create an inference endpoint called azure_ai_studio_embeddings to perform a text_embedding task type. Note that we do not specify a model here, as it is defined already via our Azure AI Studio deployment.

The list of embeddings models that you can choose from in your deployment can be found in the Azure AI Studio model explorer.

PUT _inference/text_embedding/azure_ai_studio_embeddings
    "service": "azureaistudio",
    "service_settings": {
        "api_key": "<api_key>",
        "target": "<target_uri>",
        "provider": "<model_provider>",
        "endpoint_type": "<endpoint_type>"

The next example shows how to create an inference endpoint called azure_ai_studio_completion to perform a completion task type.

PUT _inference/completion/azure_ai_studio_completion
    "service": "azureaistudio",
    "service_settings": {
        "api_key": "<api_key>",
        "target": "<target_uri>",
        "provider": "<model_provider>",
        "endpoint_type": "<endpoint_type>"

The list of chat completion models that you can choose from in your deployment can be found in the Azure AI Studio model explorer.

Azure OpenAI serviceedit

The following example shows how to create an inference endpoint called azure_openai_embeddings to perform a text_embedding task type. Note that we do not specify a model here, as it is defined already via our Azure OpenAI deployment.

The list of embeddings models that you can choose from in your deployment can be found in the Azure models documentation.

resp = client.inference.put_model(
        "service": "azureopenai",
        "service_settings": {
            "api_key": "<api_key>",
            "resource_name": "<resource_name>",
            "deployment_id": "<deployment_id>",
            "api_version": "2024-02-01",
PUT _inference/text_embedding/azure_openai_embeddings
    "service": "azureopenai",
    "service_settings": {
        "api_key": "<api_key>",
        "resource_name": "<resource_name>",
        "deployment_id": "<deployment_id>",
        "api_version": "2024-02-01"

The next example shows how to create an inference endpoint called azure_openai_completion to perform a completion task type.

PUT _inference/completion/azure_openai_completion
    "service": "azureopenai",
    "service_settings": {
        "api_key": "<api_key>",
        "resource_name": "<resource_name>",
        "deployment_id": "<deployment_id>",
        "api_version": "2024-02-01"

The list of chat completion models that you can choose from in your Azure OpenAI deployment can be found at the following places:

Cohere serviceedit

The following example shows how to create an inference endpoint called cohere-embeddings to perform a text_embedding task type.

resp = client.inference.put_model(
        "service": "cohere",
        "service_settings": {
            "api_key": "<api_key>",
            "model_id": "embed-english-light-v3.0",
            "embedding_type": "byte",
PUT _inference/text_embedding/cohere-embeddings
    "service": "cohere",
    "service_settings": {
        "api_key": "<api_key>",
        "model_id": "embed-english-light-v3.0",
        "embedding_type": "byte"

The following example shows how to create an inference endpoint called cohere-rerank to perform a rerank task type.

resp = client.inference.put_model(
        "service": "cohere",
        "service_settings": {
            "api_key": "<API-KEY>",
            "model_id": "rerank-english-v3.0",
        "task_settings": {"top_n": 10, "return_documents": True},
PUT _inference/rerank/cohere-rerank
    "service": "cohere",
    "service_settings": {
        "api_key": "<API-KEY>",
        "model_id": "rerank-english-v3.0"
    "task_settings": {
        "top_n": 10,
        "return_documents": true

For more examples, also review the Cohere documentation.

E5 via the elasticsearch serviceedit

The following example shows how to create an inference endpoint called my-e5-model to perform a text_embedding task type.

resp = client.inference.put_model(
        "service": "elasticsearch",
        "service_settings": {
            "num_allocations": 1,
            "num_threads": 1,
            "model_id": ".multilingual-e5-small",
PUT _inference/text_embedding/my-e5-model
  "service": "elasticsearch",
  "service_settings": {
    "num_allocations": 1,
    "num_threads": 1,
    "model_id": ".multilingual-e5-small" 

The model_id must be the ID of one of the built-in E5 models. Valid values are .multilingual-e5-small and .multilingual-e5-small_linux-x86_64. For further details, refer to the E5 model documentation.

ELSER serviceedit

The following example shows how to create an inference endpoint called my-elser-model to perform a sparse_embedding task type. Refer to the ELSER model documentation for more info.

resp = client.inference.put_model(
        "service": "elser",
        "service_settings": {"num_allocations": 1, "num_threads": 1},
PUT _inference/sparse_embedding/my-elser-model
  "service": "elser",
  "service_settings": {
    "num_allocations": 1,
    "num_threads": 1

Example response:

  "inference_id": "my-elser-model",
  "task_type": "sparse_embedding",
  "service": "elser",
  "service_settings": {
    "num_allocations": 1,
    "num_threads": 1
  "task_settings": {}
Google AI Studio serviceedit

The following example shows how to create an inference endpoint called google_ai_studio_completion to perform a completion task type.

PUT _inference/completion/google_ai_studio_completion
    "service": "googleaistudio",
    "service_settings": {
        "api_key": "<api_key>",
        "model_id": "<model_id>"
Hugging Face serviceedit

The following example shows how to create an inference endpoint called hugging-face-embeddings to perform a text_embedding task type.

resp = client.inference.put_model(
        "service": "hugging_face",
        "service_settings": {
            "api_key": "<access_token>",
            "url": "<url_endpoint>",
PUT _inference/text_embedding/hugging-face-embeddings
  "service": "hugging_face",
  "service_settings": {
    "api_key": "<access_token>", 
    "url": "<url_endpoint>" 

A valid Hugging Face access token. You can find on the settings page of your account.

The inference endpoint URL you created on Hugging Face.

Create a new inference endpoint on the Hugging Face endpoint page to get an endpoint URL. Select the model you want to use on the new endpoint creation page - for example intfloat/e5-small-v2 - then select the Sentence Embeddings task under the Advanced configuration section. Create the endpoint. Copy the URL after the endpoint initialization has been finished.

The list of recommended models for the Hugging Face service:

Models uploaded by Eland via the elasticsearch serviceedit

The following example shows how to create an inference endpoint called my-msmarco-minilm-model to perform a text_embedding task type.

resp = client.inference.put_model(
        "service": "elasticsearch",
        "service_settings": {
            "num_allocations": 1,
            "num_threads": 1,
            "model_id": "msmarco-MiniLM-L12-cos-v5",
PUT _inference/text_embedding/my-msmarco-minilm-model
  "service": "elasticsearch",
  "service_settings": {
    "num_allocations": 1,
    "num_threads": 1,
    "model_id": "msmarco-MiniLM-L12-cos-v5" 

The model_id must be the ID of a text embedding model which has already been uploaded through Eland.

OpenAI serviceedit

The following example shows how to create an inference endpoint called openai-embeddings to perform a text_embedding task type.

resp = client.inference.put_model(
        "service": "openai",
        "service_settings": {
            "api_key": "<api_key>",
            "model_id": "text-embedding-ada-002",
PUT _inference/text_embedding/openai-embeddings
    "service": "openai",
    "service_settings": {
        "api_key": "<api_key>",
        "model_id": "text-embedding-ada-002"

The next example shows how to create an inference endpoint called openai-completion to perform a completion task type.

resp = client.inference.put_model(
        "service": "openai",
        "service_settings": {
            "api_key": "<api_key>",
            "model_id": "gpt-3.5-turbo",
PUT _inference/completion/openai-completion
    "service": "openai",
    "service_settings": {
        "api_key": "<api_key>",
        "model_id": "gpt-3.5-turbo"