Deploy the model in your cluster

After you import the model and vocabulary, you can use Kibana to view and manage their deployment across your cluster under Machine Learning > Model Management. Alternatively, you can use the start trained model deployment API. Since eland uses APIs to deploy models, they won't appear in Kibana until the saved objects are synchronized. You can follow the prompts in Kibana, wait for automatic synchronization, or use the sync machine learning saved objects API.

You can deploy a model multiple times by assigning a unique deployment ID when starting the deployment. You can optimize your deployment for typical use cases, such as search and ingest. When you optimize for ingest, the throughput will be higher, which increases the number of inference requests that can be performed in parallel. When you optimize for search, the latency will be lower during search processes.

Dedicated deployments for search and ingest keep each workload from affecting the other, and avoid performance issues that can be hard to diagnose.

Note

The Optimize for use case options (Ingest, Search, Balanced) apply only to embedding models. Rerank models do not include this selector in the deployment modal.

Each deployment will be fine-tuned automatically based on the specific purpose you choose. For Elastic Rerank, you can start or update deployments from the Trained Models page, or deploy directly using the inference API.

Model deployment on the Trained Models UI.

You can define the resource usage level of the NLP model during model deployment. The resource usage levels behave differently depending on adaptive resources being enabled or disabled. When adaptive resources are disabled but machine learning autoscaling is enabled, vCPU usage of Cloud deployments derived from the Cloud console and functions as follows:

Low: This level limits resources to two vCPUs, which may be suitable for development, testing, and demos depending on your parameters. It is not recommended for production use
Medium: This level limits resources to 32 vCPUs, which may be suitable for development, testing, and demos depending on your parameters. It is not recommended for production use.
High: This level may use the maximum number of vCPUs available for this deployment from the Cloud console. If the maximum is 2 vCPUs or fewer, this level is equivalent to the medium or low level.

For the resource levels when adaptive resources are enabled, refer to Trained model autoscaling.

Request queues and search priority

Each allocation of a model deployment has a dedicated queue to buffer inference requests. The size of this queue is determined by the queue_capacity parameter in the start trained model deployment API. When the queue reaches its maximum capacity, new requests are declined until some of the queued requests are processed, creating available capacity once again. When multiple ingest pipelines reference the same deployment, the queue can fill up, resulting in rejected requests. Consider using dedicated deployments to prevent this situation.

Inference requests originating from search, such as the text_expansion query, have a higher priority compared to non-search requests. The inference ingest processor generates normal priority requests. If both a search query and an ingest processor use the same deployment, the search requests with higher priority skip ahead in the queue for processing before the lower priority ingest requests. This prioritization accelerates search responses while potentially slowing down ingest where response time is less critical.