Perform chat completion inference on the service Generally available; Added in 8.18.0

POST /_inference/chat_completion/{inference_id}/_stream

The chat completion inference API enables real-time responses for chat completion tasks by delivering answers incrementally, reducing response times during computation. It only works with the chat_completion task type.

NOTE: The chat_completion task type is only available within the _stream API and only supports streaming. The Chat completion inference API and the Stream inference API differ in their response structure and capabilities. The Chat completion inference API provides more comprehensive customization options through more fields and function calling support. To determine whether a given inference service supports this task type, please see the page for that service.

Path parameters

  • inference_id string Required

    The inference Id

Query parameters

application/json

Body Required

  • messages array[object] Required

    A list of objects representing the conversation. Requests should generally only add new messages from the user (role user). The other message roles (assistant, system, or tool) should generally only be copied from the response to a previous completion request, such that the messages array is built up throughout a conversation.

    An object representing part of the conversation.

    Hide messages attributes Show messages attributes object
    • content string | array[object]

      The content of the message.

      String example:

      {
         "content": "Some string"
      }
      

      Text example:

      {
        "content": [
            {
             "text": "Some text",
             "type": "text"
            }
         ]
      }
      

      Image example:

      {
        "content": [
            {
             "image_url": {
               "url": "data:image/jpg;base64,..."
             },
             "type": "image_url"
            }
         ]
      }
      

      File example:

      {
        "content": [
            {
             "file": {
               "file_data": "data:application/pdf;base64,...",
               "filename": "somePDF"
             },
             "type": "file"
            }
         ]
      }
      
    • role string Required

      The role of the message author. Valid values are user, assistant, system, and tool.

    • tool_call_id string

      Only for tool role messages. The tool call that this message is responding to.

    • tool_calls array[object]

      Only for assistant role messages. The tool calls generated by the model. If it's specified, the content field is optional. Example:

      {
        "tool_calls": [
            {
                "id": "call_KcAjWtAww20AihPHphUh46Gd",
                "type": "function",
                "function": {
                    "name": "get_current_weather",
                    "arguments": "{\"location\":\"Boston, MA\"}"
                }
            }
        ]
      }
      

      A tool call generated by the model.

      Hide tool_calls attributes Show tool_calls attributes object
      • id string Required

        The identifier of the tool call.

      • function object Required

        The function that the model called.

        Hide function attributes Show function attributes object
        • arguments string Required

          The arguments to call the function with in JSON format.

        • name string Required

          The name of the function to call.

      • type string Required

        The type of the tool call.

    • reasoning string

      Only for assistant role messages. The reasoning details generated by the model as plaintext. Currently supported only for elastic provider.

    • reasoning_details array[object]

      Only for assistant role messages. The reasoning details generated by the model as structured data. Currently supported only for elastic provider.

      Type representing the different types of reasoning details that can be included in the response from the model. Currently supported only for elastic provider.

      Type representing the different types of reasoning details that can be included in the response from the model. Currently supported only for elastic provider.

      One of:
  • model string

    The ID of the model to use. By default, the model ID is set to the value included when creating the inference endpoint.

  • max_completion_tokens number

    The upper bound limit for the number of tokens that can be generated for a completion request.

  • reasoning object

    The reasoning configuration for the completion request. This controls the model's reasoning process in one of three ways:

    • By specifying the model’s reasoning effort level with the effort field.
    • By setting a maximum number of reasoning tokens with the max_tokens field.
    • By enabling reasoning with default settings by setting enabled field to true.

    It also includes optional settings to control:

    • The level of detail in the summary returned in the response with the summary field.
    • Whether reasoning details are included in the response at all with the exclude field.

    Example (effort):

    {
       "reasoning": {
           "effort": "high",
           "summary": "concise",
           "exclude": false
       }
    }
    
    

    Example (max_tokens):

    {
       "reasoning": {
           "max_tokens": 100,
           "summary": "concise",
           "exclude": false
       }
    }
    

    Example (enabled):

    {
       "reasoning": {
           "enabled": true,
           "summary": "concise",
           "exclude": false
       }
    }
    

    Currently supported only for elastic provider.

    Hide reasoning attributes Show reasoning attributes object
    • effort string

      The level of effort the model should put into reasoning. This is a hint that guides the model in how much effort to put into reasoning, with xhigh being the most effort and none being no effort. It cannot be used together with max_tokens.

      Values are xhigh, high, medium, low, minimal, or none.

    • enabled boolean

      Whether to enable reasoning with default settings. This is a shortcut for enabling reasoning without having to specify the other parameters. If enabled is set to true, then reasoning at the medium effort level is enabled. Ignored if either effort or max_tokens is specified, in which case those parameters will control the reasoning process instead.

    • exclude boolean

      Whether to exclude reasoning information from the response. If true, the response will not include any reasoning details.

    • max_tokens number

      The maximum number of tokens the model can use for reasoning. It cannot be used together with effort.

    • summary string

      The level of detail included in the reasoning summary returned in the response. This is a hint on how much detail to include in the summary of the reasoning that is returned in the response, with auto being the default level of detail, concise being less detail, and detailed being more detail.

      Values are auto, concise, or detailed.

  • stop array[string]

    A sequence of strings to control when the model should stop generating additional tokens.

  • temperature number

    The sampling temperature to use.

  • tool_choice string | object

    Controls which tool is called by the model. String representation: One of auto, none, or requrired. auto allows the model to choose between calling tools and generating a message. none causes the model to not call any tools. required forces the model to call one or more tools. Example (object representation):

    {
      "tool_choice": {
          "type": "function",
          "function": {
              "name": "get_current_weather"
          }
      }
    }
    
    One of:
  • tools array[object]

    A list of tools that the model can call. Example:

    {
      "tools": [
          {
              "type": "function",
              "function": {
                  "name": "get_price_of_item",
                  "description": "Get the current price of an item",
                  "parameters": {
                      "type": "object",
                      "properties": {
                          "item": {
                              "id": "12345"
                          },
                          "unit": {
                              "type": "currency"
                          }
                      }
                  }
              }
          }
      ]
    }
    

    A list of tools that the model can call.

    Hide tools attributes Show tools attributes object
    • type string Required

      The type of tool.

    • function object Required

      The function definition.

      Hide function attributes Show function attributes object
      • description string

        A description of what the function does. This is used by the model to choose when and how to call the function.

      • name string Required

        The name of the function.

      • parameters object

        The parameters the functional accepts. This should be formatted as a JSON object.

      • strict boolean

        Whether to enable schema adherence when generating the function call.

  • top_p number

    Nucleus sampling, an alternative to sampling with temperature.

Responses

  • 200 application/json
POST /_inference/chat_completion/{inference_id}/_stream
POST _inference/chat_completion/openai-completion/_stream
{
  "model": "gpt-4o",
  "messages": [
      {
          "role": "user",
          "content": "What is Elastic?"
      }
  ]
}
resp = client.inference.chat_completion_unified(
    inference_id="openai-completion",
    chat_completion_request={
        "model": "gpt-4o",
        "messages": [
            {
                "role": "user",
                "content": "What is Elastic?"
            }
        ]
    },
)
const response = await client.inference.chatCompletionUnified({
  inference_id: "openai-completion",
  chat_completion_request: {
    model: "gpt-4o",
    messages: [
      {
        role: "user",
        content: "What is Elastic?",
      },
    ],
  },
});
response = client.inference.chat_completion_unified(
  inference_id: "openai-completion",
  body: {
    "model": "gpt-4o",
    "messages": [
      {
        "role": "user",
        "content": "What is Elastic?"
      }
    ]
  }
)
$resp = $client->inference()->chatCompletionUnified([
    "inference_id" => "openai-completion",
    "body" => [
        "model" => "gpt-4o",
        "messages" => array(
            [
                "role" => "user",
                "content" => "What is Elastic?",
            ],
        ),
    ],
]);
curl -X POST -H "Authorization: ApiKey $ELASTIC_API_KEY" -H "Content-Type: application/json" -d '{"model":"gpt-4o","messages":[{"role":"user","content":"What is Elastic?"}]}' "$ELASTICSEARCH_URL/_inference/chat_completion/openai-completion/_stream"
client.inference().chatCompletionUnified(c -> c
    .inferenceId("openai-completion")
    .chatCompletionRequest(ch -> ch
        .messages(m -> m
            .content(co -> co
                .string("What is Elastic?")
            )
            .role("user")
        )
        .model("gpt-4o")
    )
);
Request examples
Run `POST _inference/chat_completion/openai-completion/_stream` to perform a chat completion on the example question with streaming.
{
  "model": "gpt-4o",
  "messages": [
      {
          "role": "user",
          "content": "What is Elastic?"
      }
  ]
}
Run `POST _inference/chat_completion/openai-completion/_stream` to perform a chat completion using an Assistant message with `tool_calls`.
{
  "messages": [
      {
          "role": "assistant",
          "content": "Let's find out what the weather is",
          "tool_calls": [ 
              {
                  "id": "call_KcAjWtAww20AihPHphUh46Gd",
                  "type": "function",
                  "function": {
                      "name": "get_current_weather",
                      "arguments": "{\"location\":\"Boston, MA\"}"
                  }
              }
          ]
      },
      { 
          "role": "tool",
          "content": "The weather is cold",
          "tool_call_id": "call_KcAjWtAww20AihPHphUh46Gd"
      }
  ]
}
Run `POST _inference/chat_completion/openai-completion/_stream` to perform a chat completion using a User message with `tools` and `tool_choice`.
{
  "messages": [
      {
          "role": "user",
          "content": [
              {
                  "type": "text",
                  "text": "What's the price of a scarf?"
              }
          ]
      }
  ],
  "tools": [
      {
          "type": "function",
          "function": {
              "name": "get_current_price",
              "description": "Get the current price of a item",
              "parameters": {
                  "type": "object",
                  "properties": {
                      "item": {
                          "id": "123"
                      }
                  }
              }
          }
      }
  ],
  "tool_choice": {
      "type": "function",
      "function": {
          "name": "get_current_price"
      }
  }
}
Run `POST _inference/chat_completion/reasoning-chat-completion/_stream` to perform a chat completion task, using both `effort` parameter based reasoning configuration and including reasoning generated by the model on previous step.
{
  "messages": [{
      "role": "user",
      "content": [{
          "type": "text",
          "text": "Barber shaves all those, who do not shave themselves. Who shaves the barber?"
        }
      ]
    }, {
      "role": "assistant",
      "content": [{
          "type": "text",
          "text": "This is the barber paradox. Such a barber cannot logically exist."
        }
      ],
      "reasoning": "If the barber shaves himself, he should not; if he does not, he should.",
      "reasoning_details": [{
          "type": "reasoning.encrypted",
          "data": "[REDACTED]"
        }, {
          "type": "reasoning.summary",
          "summary": "Barber shaving himself creates contradiction"
        }, {
          "type": "reasoning.text",
          "text": "If the barber shaves himself, he should not; if he does not, he should.",
          "signature": "sig_123"
        }
      ]
    }, {
      "role": "user",
      "content": [{
          "type": "text",
          "text": "What if there are 2 barbers?"
        }
      ]
    }
  ],
  "reasoning": {
    "effort": "high",
    "summary": "detailed",
    "exclude": false
  }
}
Run `POST _inference/chat_completion/reasoning-chat-completion/_stream` to perform a chat completion task, using both `max_tokens` parameter based reasoning configuration and including reasoning generated by the model on previous step.
{
  "messages": [{
      "role": "user",
      "content": [{
          "type": "text",
          "text": "Barber shaves all those, who do not shave themselves. Who shaves the barber?"
        }
      ]
    }, {
      "role": "assistant",
      "content": [{
          "type": "text",
          "text": "This is the barber paradox. Such a barber cannot logically exist."
        }
      ],
      "reasoning": "If the barber shaves himself, he should not; if he does not, he should.",
      "reasoning_details": [{
          "type": "reasoning.encrypted",
          "data": "[REDACTED]"
        }, {
          "type": "reasoning.summary",
          "summary": "Barber shaving himself creates contradiction"
        }, {
          "type": "reasoning.text",
          "text": "If the barber shaves himself, he should not; if he does not, he should.",
          "signature": "sig_123"
        }
      ]
    }, {
      "role": "user",
      "content": [{
          "type": "text",
          "text": "What if there are 2 barbers?"
        }
      ]
    }
  ],
  "reasoning": {
    "max_tokens": 100,
    "summary": "detailed",
    "exclude": false
  }
}
Run `POST _inference/chat_completion/reasoning-chat-completion/_stream` to perform a chat completion task, using both `enabled` parameter based reasoning configuration and including reasoning generated by the model on previous step.
{
  "messages": [{
      "role": "user",
      "content": [{
          "type": "text",
          "text": "Barber shaves all those, who do not shave themselves. Who shaves the barber?"
        }
      ]
    }, {
      "role": "assistant",
      "content": [{
          "type": "text",
          "text": "This is the barber paradox. Such a barber cannot logically exist."
        }
      ],
      "reasoning": "If the barber shaves himself, he should not; if he does not, he should.",
      "reasoning_details": [{
          "type": "reasoning.encrypted",
          "data": "[REDACTED]"
        }, {
          "type": "reasoning.summary",
          "summary": "Barber shaving himself creates contradiction"
        }, {
          "type": "reasoning.text",
          "text": "If the barber shaves himself, he should not; if he does not, he should.",
          "signature": "sig_123"
        }
      ]
    }, {
      "role": "user",
      "content": [{
          "type": "text",
          "text": "What if there are 2 barbers?"
        }
      ]
    }
  ],
  "reasoning": {
    "enabled": true,
    "summary": "detailed",
    "exclude": false
  }
}
Response examples (200)
A successful response when performing a chat completion task using a User message with `tools` and `tool_choice`.
event: message
data: {"chat_completion":{"id":"chatcmpl-Ae0TWsy2VPnSfBbv5UztnSdYUMFP3","choices":[{"delta":{"content":"","role":"assistant"},"index":0}],"model":"gpt-4o-2024-08-06","object":"chat.completion.chunk"}}

event: message
data: {"chat_completion":{"id":"chatcmpl-Ae0TWsy2VPnSfBbv5UztnSdYUMFP3","choices":[{"delta":{"content":Elastic"},"index":0}],"model":"gpt-4o-2024-08-06","object":"chat.completion.chunk"}}

event: message
data: {"chat_completion":{"id":"chatcmpl-Ae0TWsy2VPnSfBbv5UztnSdYUMFP3","choices":[{"delta":{"content":" is"},"index":0}],"model":"gpt-4o-2024-08-06","object":"chat.completion.chunk"}}

(...)

event: message
data: {"chat_completion":{"id":"chatcmpl-Ae0TWsy2VPnSfBbv5UztnSdYUMFP3","choices":[],"model":"gpt-4o-2024-08-06","object":"chat.completion.chunk","usage":{"completion_tokens":28,"prompt_tokens":16,"total_tokens":44}}} 

event: message
data: [DONE]
A successful response when performing a chat completion task with response-level reasoning data.
event: message
data: {"chat_completion":{"id":"chatcmpl-910TWsy2VPnSfBbv5UztnSdYUJA10","choices":[{"delta":{"content":"With two barbers, the paradox disappears.","role":"assistant"},"index":0,"reasoning":"The contradiction only occurs when a barber must determine whether to shave himself.","reasoning_details":[{"type":"reasoning.encrypted","data":"[REDACTED]"},{"type":"reasoning.summary","summary":"Two barbers can shave each other."},{"type":"reasoning.text","text":"The contradiction only occurs when a barber must determine whether to shave himself.","signature":"sig_example"}]}],"model":"openai-gpt-oss-120b","object":"chat.completion.chunk"}}

event: message
data: {"chat_completion":{"id":"chatcmpl-910TWsy2VPnSfBbv5UztnSdYUJA10","choices":[{"delta":{"summary":"Each barber can shave the other, so neither needs to shave himself.","role":"assistant"},"index":0,"reasoning":"With two barbers, they can shave each other.","reasoning_details":[{"type":"reasoning.encrypted","data":"[REDACTED]"},{"type":"reasoning.summary","summary":"avoiding the self-reference paradox"},{"type":"reasoning.text","format":"some_text_reasoning_detail_format","text":"With two barbers, they can shave each other.","signature":"sig_example"}]}],"model":"openai-gpt-oss-120b","object":"chat.completion.chunk"}}

(...)

event: message
data: {"chat_completion":{"id":"chatcmpl-910TWsy2VPnSfBbv5UztnSdYUJA10","choices":[],"model":"openai-gpt-oss-120b","object":"chat.completion.chunk","usage":{"completion_tokens":28,"prompt_tokens":16,"total_tokens":44,"completion_tokens_details":{"reasoning_tokens":10}}}}

event: message
data: [DONE]