Create a Google Vertex AI inference endpoint
Generally available
Path parameters
-
The type of the inference task that the model will perform.
Values are
rerank,text_embedding,completion, orchat_completion. -
The unique identifier of the inference endpoint.
Query parameters
-
Specifies the amount of time to wait for the inference endpoint to be created.
External documentation
Body
Required
-
The chunking configuration object. Applies only to the
text_embeddingtask type. Not applicable to thererank,completion, orchat_completiontask types.External documentation -
The type of service supported for the specified task type. In this case,
googlevertexai.Value is
googlevertexai. -
Settings used to install the inference model. These settings are specific to the
googlevertexaiservice. -
Settings to configure the inference task. These settings are specific to the task type you specified.
PUT
/_inference/{task_type}/{googlevertexai_inference_id}
Console
PUT _inference/text_embedding/google_vertex_ai_embeddingss
{
"service": "googlevertexai",
"service_settings": {
"service_account_json": "service-account-json",
"model_id": "model-id",
"location": "location",
"project_id": "project-id"
}
}
resp = client.inference.put(
task_type="text_embedding",
inference_id="google_vertex_ai_embeddingss",
inference_config={
"service": "googlevertexai",
"service_settings": {
"service_account_json": "service-account-json",
"model_id": "model-id",
"location": "location",
"project_id": "project-id"
}
},
)
const response = await client.inference.put({
task_type: "text_embedding",
inference_id: "google_vertex_ai_embeddingss",
inference_config: {
service: "googlevertexai",
service_settings: {
service_account_json: "service-account-json",
model_id: "model-id",
location: "location",
project_id: "project-id",
},
},
});
response = client.inference.put(
task_type: "text_embedding",
inference_id: "google_vertex_ai_embeddingss",
body: {
"service": "googlevertexai",
"service_settings": {
"service_account_json": "service-account-json",
"model_id": "model-id",
"location": "location",
"project_id": "project-id"
}
}
)
$resp = $client->inference()->put([
"task_type" => "text_embedding",
"inference_id" => "google_vertex_ai_embeddingss",
"body" => [
"service" => "googlevertexai",
"service_settings" => [
"service_account_json" => "service-account-json",
"model_id" => "model-id",
"location" => "location",
"project_id" => "project-id",
],
],
]);
curl -X PUT -H "Authorization: ApiKey $ELASTIC_API_KEY" -H "Content-Type: application/json" -d '{"service":"googlevertexai","service_settings":{"service_account_json":"service-account-json","model_id":"model-id","location":"location","project_id":"project-id"}}' "$ELASTICSEARCH_URL/_inference/text_embedding/google_vertex_ai_embeddingss"
client.inference().put(p -> p
.inferenceId("google_vertex_ai_embeddingss")
.taskType(TaskType.TextEmbedding)
.inferenceConfig(i -> i
.service("googlevertexai")
.serviceSettings(JsonData.fromJson("{\"service_account_json\":\"service-account-json\",\"model_id\":\"model-id\",\"location\":\"location\",\"project_id\":\"project-id\"}"))
)
);
Request examples
A text embedding task
Run `PUT _inference/text_embedding/google_vertex_ai_embeddings` to create an inference endpoint to perform a `text_embedding` task type.
{
"service": "googlevertexai",
"service_settings": {
"service_account_json": "service-account-json",
"model_id": "model-id",
"location": "location",
"project_id": "project-id"
}
}
Run `PUT _inference/chat_completion/google_model_garden_meta_chat_completion` to create an inference endpoint to perform a `chat_completion` task using Meta's model hosted on Google Model Garden shared endpoint with single streaming URL provided. See the endpoint's `Sample request` page for the variable values used in the URL.
{
"service": "googlevertexai",
"service_settings": {
"provider": "meta",
"service_account_json": "service-account-json",
"streaming_url": "https://%LOCATION_ID%-aiplatform.googleapis.com/v1/projects/%PROJECT_ID%/locations/%LOCATION_ID%/endpoints/%ENDPOINT_ID%/chat/completions"
}
}
Run `PUT _inference/completion/google_model_garden_hugging_face_completion` to create an inference endpoint to perform a `completion` task using Hugging Face's model hosted on Google Model Garden dedicated endpoint with single URL provided for both streaming and non-streaming tasks. See the endpoint's `Sample request` page for the variable values used in the URL.
{
"service": "googlevertexai",
"service_settings": {
"provider": "hugging_face",
"service_account_json": "service-account-json",
"url": "https://%ENDPOINT_ID%.%LOCATION_ID%-%PROJECT_ID%.prediction.vertexai.goog/v1/projects/%PROJECT_ID%/locations/%LOCATION_ID%/endpoints/%ENDPOINT_ID%/chat/completions"
}
}
Run `PUT _inference/chat_completion/google_model_garden_hugging_face_chat_completion` to create an inference endpoint to perform a `chat_completion` task using Hugging Face's model hosted on Google Model Garden dedicated endpoint with single streaming URL provided. See the endpoint's `Sample request` page for the variable values used in the URL.
{
"service": "googlevertexai",
"service_settings": {
"provider": "hugging_face",
"service_account_json": "service-account-json",
"streaming_url": "https://%ENDPOINT_ID%.%LOCATION_ID%-%PROJECT_ID%.prediction.vertexai.goog/v1/projects/%PROJECT_ID%/locations/%LOCATION_ID%/endpoints/%ENDPOINT_ID%/chat/completions"
}
}
Run `PUT _inference/completion/google_model_garden_hugging_face_completion` to create an inference endpoint to perform a `completion` task using Hugging Face's model hosted on Google Model Garden shared endpoint with single URL provided for both streaming and non-streaming tasks. See the endpoint's `Sample request` page for the variable values used in the URL.
{
"service": "googlevertexai",
"service_settings": {
"provider": "hugging_face",
"service_account_json": "service-account-json",
"url": "https://%LOCATION_ID%-aiplatform.googleapis.com/v1/projects/%PROJECT_ID%/locations/%LOCATION_ID%/endpoints/%ENDPOINT_ID%/chat/completions"
}
}
Run `PUT _inference/chat_completion/google_model_garden_hugging_face_chat_completion` to create an inference endpoint to perform a `chat_completion` task using Hugging Face's model hosted on Google Model Garden shared endpoint with single streaming URL provided. See the endpoint's `Sample request` page for the variable values used in the URL.
{
"service": "googlevertexai",
"service_settings": {
"provider": "hugging_face",
"service_account_json": "service-account-json",
"streaming_url": "https://%LOCATION_ID%-aiplatform.googleapis.com/v1/projects/%PROJECT_ID%/locations/%LOCATION_ID%/endpoints/%ENDPOINT_ID%/chat/completions"
}
}
Run `PUT _inference/completion/google_model_garden_mistral_completion` to create an inference endpoint to perform a `completion` task using Mistral's serverless model hosted on Google Model Garden with separate URLs for streaming and non-streaming tasks. See the Mistral model documentation for instructions on how to construct URLs.
{
"service": "googlevertexai",
"service_settings": {
"provider": "mistral",
"model_id": "mistral-small-2503",
"service_account_json": "service-account-json",
"url": "https://%LOCATION_ID%-aiplatform.googleapis.com/v1/projects/%PROJECT_ID%/locations/%LOCATION_ID%/publishers/mistralai/models/%MODEL_ID%:rawPredict",
"streaming_url": "https://%LOCATION_ID%-aiplatform.googleapis.com/v1/projects/%PROJECT_ID%/locations/%LOCATION_ID%/publishers/mistralai/models/%MODEL_ID%:streamRawPredict"
}
}
Run `PUT _inference/chat_completion/google_model_garden_mistral_chat_completion` to create an inference endpoint to perform a `chat_completion` task using Mistral's serverless model hosted on Google Model Garden with single streaming URL provided. See the Mistral model documentation for instructions on how to construct the URL.
{
"service": "googlevertexai",
"service_settings": {
"provider": "mistral",
"model_id": "mistral-small-2503",
"service_account_json": "service-account-json",
"streaming_url": "https://%LOCATION_ID%-aiplatform.googleapis.com/v1/projects/%PROJECT_ID%/locations/%LOCATION_ID%/publishers/mistralai/models/%MODEL_ID%:streamRawPredict"
}
}
Run `PUT _inference/completion/google_model_garden_mistral_completion` to create an inference endpoint to perform a `completion` task using Mistral's model hosted on Google Model Garden dedicated endpoint with single URL provided for both streaming and non-streaming tasks. See the endpoint's `Sample request` page for the variable values used in the URL.
{
"service": "googlevertexai",
"service_settings": {
"provider": "mistral",
"service_account_json": "service-account-json",
"url": "https://%ENDPOINT_ID%.%LOCATION_ID%-%PROJECT_ID%.prediction.vertexai.goog/v1/projects/%PROJECT_ID%/locations/%LOCATION_ID%/endpoints/%ENDPOINT_ID%/chat/completions"
}
}
Run `PUT _inference/chat_completion/google_model_garden_mistral_chat_completion` to create an inference endpoint to perform a `chat_completion` task using Mistral's model hosted on Google Model Garden dedicated endpoint with single streaming URL provided. See the endpoint's `Sample request` page for the variable values used in the URL.
{
"service": "googlevertexai",
"service_settings": {
"provider": "mistral",
"service_account_json": "service-account-json",
"streaming_url": "https://%ENDPOINT_ID%.%LOCATION_ID%-%PROJECT_ID%.prediction.vertexai.goog/v1/projects/%PROJECT_ID%/locations/%LOCATION_ID%/endpoints/%ENDPOINT_ID%/chat/completions"
}
}
Run `PUT _inference/completion/google_model_garden_mistral_completion` to create an inference endpoint to perform a `completion` task using Mistral's model hosted on Google Model Garden shared endpoint with single URL provided for both streaming and non-streaming tasks. See the endpoint's `Sample request` page for the variable values used in the URL.
{
"service": "googlevertexai",
"service_settings": {
"provider": "mistral",
"service_account_json": "service-account-json",
"url": "https://%LOCATION_ID%-aiplatform.googleapis.com/v1/projects/%PROJECT_ID%/locations/%LOCATION_ID%/endpoints/%ENDPOINT_ID%/chat/completions"
}
}
Run `PUT _inference/rerank/google_vertex_ai_rerank` to create an inference endpoint to perform a `rerank` task type.
{
"service": "googlevertexai",
"service_settings": {
"service_account_json": "service-account-json",
"project_id": "project-id"
}
}
Run `PUT _inference/chat_completion/google_model_garden_mistral_chat_completion` to create an inference endpoint to perform a `chat_completion` task using Mistral's model hosted on Google Model Garden shared endpoint with single streaming URL provided. See the endpoint's `Sample request` page for the variable values used in the URL.
{
"service": "googlevertexai",
"service_settings": {
"provider": "mistral",
"service_account_json": "service-account-json",
"streaming_url": "https://%LOCATION_ID%-aiplatform.googleapis.com/v1/projects/%PROJECT_ID%/locations/%LOCATION_ID%/endpoints/%ENDPOINT_ID%/chat/completions"
}
}
Run `PUT _inference/completion/google_model_garden_ai21_completion` to create an inference endpoint to perform a `completion` task using AI21's model hosted on Google Model Garden serverless endpoint with separate URLs for streaming and non-streaming tasks. See the AI21 model documentation for instructions on how to construct URLs.
{
"service": "googlevertexai",
"service_settings": {
"provider": "ai21",
"service_account_json": "service-account-json",
"url": "https://%LOCATION_ID%-aiplatform.googleapis.com/v1/projects/%PROJECT_ID%/locations/%LOCATION_ID%/publishers/ai21/models/%MODEL_ID%:rawPredict",
"streaming_url": "https://%LOCATION_ID%-aiplatform.googleapis.com/v1/projects/%PROJECT_ID%/locations/%LOCATION_ID%/publishers/ai21/models/%MODEL_ID%:streamRawPredict"
}
}
Run `PUT _inference/chat_completion/google_model_garden_ai21_chat_completion` to create an inference endpoint to perform a `chat_completion` task using AI21's model hosted on Google Model Garden serverless endpoint with single streaming URL provided. See the AI21 model documentation for instructions on how to construct URLs.
{
"service": "googlevertexai",
"service_settings": {
"provider": "ai21",
"service_account_json": "service-account-json",
"streaming_url": "https://%LOCATION_ID%-aiplatform.googleapis.com/v1/projects/%PROJECT_ID%/locations/%LOCATION_ID%/publishers/ai21/models/%MODEL_ID%:streamRawPredict"
}
}
Run `PUT _inference/completion/google_model_garden_anthropic_completion` to create an inference endpoint to perform a `completion` task using Anthropic's serverless model hosted on Google Model Garden with separate URLs for streaming and non-streaming tasks. See the Anthropic model documentation for instructions on how to construct URLs.
{
"service": "googlevertexai",
"service_settings": {
"provider": "anthropic",
"service_account_json": "service-account-json",
"url": "https://%LOCATION_ID%-aiplatform.googleapis.com/v1/projects/%PROJECT_ID%/locations/%LOCATION_ID%/publishers/anthropic/models/%MODEL_ID%:rawPredict",
"streaming_url": "https://%LOCATION_ID%-aiplatform.googleapis.com/v1/projects/%PROJECT_ID%/locations/%LOCATION_ID%/publishers/anthropic/models/%MODEL_ID%:streamRawPredict"
},
"task_settings": {
"max_tokens": 128
}
}
Run `PUT _inference/chat_completion/google_model_garden_anthropic_chat_completion` to create an inference endpoint to perform a `chat_completion` task using Anthropic's serverless model hosted on Google Model Garden with single streaming URL provided. See the Anthropic model documentation for instructions on how to construct the URL.
{
"service": "googlevertexai",
"service_settings": {
"provider": "anthropic",
"service_account_json": "service-account-json",
"streaming_url": "https://%LOCATION_ID%-aiplatform.googleapis.com/v1/projects/%PROJECT_ID%/locations/%LOCATION_ID%/publishers/anthropic/models/%MODEL_ID%:streamRawPredict"
},
"task_settings": {
"max_tokens": 128
}
}
Run `PUT _inference/completion/google_model_garden_meta_completion` to create an inference endpoint to perform a `completion` task using Meta's serverless model hosted on Google Model Garden with single URL provided for both streaming and non-streaming tasks. See the Meta model documentation for instructions on how to construct the URL.
{
"service": "googlevertexai",
"service_settings": {
"provider": "meta",
"model_id": "meta/llama-3.3-70b-instruct-maas",
"service_account_json": "service-account-json",
"url": "https://%LOCATION_ID%-aiplatform.googleapis.com/v1/projects/%PROJECT_ID%/locations/%LOCATION_ID%/endpoints/openapi/chat/completions"
}
}
Run `PUT _inference/chat_completion/google_model_garden_meta_chat_completion` to create an inference endpoint to perform a `chat_completion` task using Meta's serverless model hosted on Google Model Garden with single streaming URL provided. See the Meta model documentation for instructions on how to construct the URL.
{
"service": "googlevertexai",
"service_settings": {
"provider": "meta",
"model_id": "meta/llama-3.3-70b-instruct-maas",
"service_account_json": "service-account-json",
"streaming_url": "https://%LOCATION_ID%-aiplatform.googleapis.com/v1/projects/%PROJECT_ID%/locations/%LOCATION_ID%/endpoints/openapi/chat/completions"
}
}
Run `PUT _inference/completion/google_model_garden_meta_completion` to create an inference endpoint to perform a `completion` task using Meta's model hosted on Google Model Garden dedicated endpoint with single URL provided for both streaming and non-streaming tasks. See the endpoint's `Sample request` page for the variable values used in the URL.
{
"service": "googlevertexai",
"service_settings": {
"provider": "meta",
"service_account_json": "service-account-json",
"url": "https://%ENDPOINT_ID%.%LOCATION_ID%-fasttryout.prediction.vertexai.goog/v1/projects/%PROJECT_ID%/locations/%LOCATION_ID%/endpoints/%ENDPOINT_ID%/chat/completions"
}
}
Run `PUT _inference/chat_completion/google_model_garden_meta_chat_completion` to create an inference endpoint to perform a `chat_completion` task using Meta's model hosted on Google Model Garden dedicated endpoint with single streaming URL provided. See the endpoint's `Sample request` page for the variable values used in the URL.
{
"service": "googlevertexai",
"service_settings": {
"provider": "meta",
"service_account_json": "service-account-json",
"streaming_url": "https://%ENDPOINT_ID%.%LOCATION_ID%-fasttryout.prediction.vertexai.goog/v1/projects/%PROJECT_ID%/locations/%LOCATION_ID%/endpoints/%ENDPOINT_ID%/chat/completions"
}
}
Run `PUT _inference/completion/google_model_garden_meta_completion` to create an inference endpoint to perform a `completion` task using Meta's model hosted on Google Model Garden shared endpoint with single URL provided for both streaming and non-streaming tasks. See the endpoint's `Sample request` page for the variable values used in the URL.
{
"service": "googlevertexai",
"service_settings": {
"provider": "meta",
"service_account_json": "service-account-json",
"url": "https://%LOCATION_ID%-aiplatform.googleapis.com/v1/projects/%PROJECT_ID%/locations/%LOCATION_ID%/endpoints/%ENDPOINT_ID%/chat/completions"
}
}