Using Elasticsearch Inference API along with Hugging Face models

Learn how to connect Elasticsearch to Hugging Face models using inference endpoints, and build a multilingual blog recommendation system with semantic search and chat completions.

Agent Builder is available now GA. Get started with an Elastic Cloud Trial, and check out the documentation for Agent Builder here.

In recent updates, Elasticsearch introduced a native integration to connect to models hosted on the Hugging Face Inference Service. In this post, we’ll explore how to configure this integration and perform inference through simple API calls using a large language model (LLM). We’ll use SmolLM3-3B, a lightweight general-purpose model with a good balance between resource usage and answer quality.

Prerequisites

Chat completions using a Hugging Face inference endpoint

First, we’ll build a practical example that connects Elasticsearch to a Hugging Face inference endpoint to generate AI-powered recommendations from a collection of blog posts. For the app knowledge base, we’ll use a dataset of company blog articles, which contains valuable but often hard-to-navigate information.

With this endpoint, semantic search retrieves the most relevant articles for a given query, and a Hugging Face LLM generates short, contextual recommendations based on those results.

Let’s take a look at a high-level overview of the information flow we’re going to build:

In this article, we’ll test SmolLM3-3B capacity to combine its compact size with strong multilingual reasoning and tool-calling capabilities. Based on a search query, we’ll send all the matching content (in English and Spanish) to the LLM to generate a list of recommended articles with a custom-made description based on the search query and results.

Here’s what the UI of an article site with an AI recommendations generation system could look like.

You can find the full implementation of this application in the linked notebook.

Configuring Elasticsearch inference endpoints

To use the Elasticsearch Hugging Face inference endpoint, we need two important elements: a Hugging Face API key and a running Hugging Face endpoint URL. It should look like this:

The Hugging Face inference endpoint in Elasticsearch supports different task types: text_embedding, completion, chat_completion, and rerank. In this blog post, we use chat_completion because we need the model to generate conversational recommendations based on the search results and a system prompt.This endpoint allows us to perform chat completions directly from Elasticsearch in a simple way using the Elasticsearch API:

This will serve as the core of the application, receiving the prompt and the search results that will pass through the model. With the theory covered, let’s start implementing the application.

Setting up ​​inference endpoint on Hugging Face

To deploy the Hugging Face model, we’re going to use Hugging Face one-click deployments, an easy and fast service for deploying model endpoints. Keep in mind that this is a paid service, and using it may incur additional costs. This step will create the model instance that will be used to generate the recommendations of the articles.

You can pick a model from the one-click catalog:

Let’s pick the SmolLM3-3B model:

From here, grab the Hugging Face endpoint URL:

As mentioned in the Elasticsearch Hugging Face inference endpoints documentation, text generation requires a model that’s compatible with the OpenAI API. For that reason, we need to append the /v1/chat/completions subpath to the Hugging Face endpoint URL. The final result will look like this:

With this in place, we can start coding in a Python notebook.

Generating Hugging Face API key

Create a Hugging Face account, and obtain an API token by following these instructions. You can choose between three token types: fine-grained (recommended for production, as it provides access only to specific resources); read (for read-only access); or write (for read and write access). For this tutorial, a read token is sufficient, since we only need to call the inference endpoint. Save this key for the next step.

Setting up Elasticsearch inference endpoint

First, let’s declare an Elasticsearch Python client:

Next, let’s create an Elasticsearch inference endpoint that uses the Hugging Face model. This endpoint will allow us to generate responses based on the blog posts and the prompt passed to the model.

Dataset

The dataset contains the blog posts that will be queried, representing a multilingual content set used throughout the workflow:

Elasticsearch mappings

With the dataset defined, we need to create a data schema that properly fits the blog post structure. The following index mappings will be used to store the data in Elasticsearch:

Here, we can see more clearly how the data is structured. We’ll use semantic search to retrieve results based on natural language, along with the copy_to property to copy the field contents into the semantic_text field. Additionally, the title field contains two subfields: the original subfield stores the title in either English or Spanish, depending on the original language of the article; and the translated_title subfield is present only for Spanish articles and contains the English translation of the original title.

Ingesting data

The following code snippet ingests the blog posts dataset into Elasticsearch using the bulk API:

Now that we have the articles ingested into Elasticsearch, we need to create a function capable of searching against the semantic_text field:

We also need a function that calls the inference endpoint. In this case, we’ll call the endpoint using the chat_completion task type to get streaming responses:

Now we can write a function that calls the semantic search function, along with the chat_completions inference endpoint and the recommendations endpoint, to generate the data that will be allocated in the cards:

Finally, we need to extract the information and format it to be printed:

Let’s test this by asking a question about the security blog posts:

Here we can see the cards in the console generated by the workflow:

You can see the full results, including all hits and the LLM response, in this file.

We’re asking for articles related to: “Security and vulnerabilities.” This question is used as the search query against the documents stored in Elasticsearch. The retrieved results are then passed to the model, which generates recommendations based on their content. As we can see, the model did a great job generating engaging short text that can motivate the reader to click on it.

Conclusion

This example shows how Elasticsearch and Hugging Face can be combined to create a fast and efficient centralized system for AI applications. This approach reduces manual effort and provides flexibility, thanks to Hugging Face’s extensive model catalog. Using SmolLM3-3B, in particular, shows how compact, multilingual models can still deliver meaningful reasoning and content generation when paired with semantic search. Together, these tools offer a scalable and effective foundation for building intelligent content analysis and multilingual applications.

Contenido relacionado

¿Estás listo para crear experiencias de búsqueda de última generación?

No se logra una búsqueda suficientemente avanzada con los esfuerzos de uno. Elasticsearch está impulsado por científicos de datos, operaciones de ML, ingenieros y muchos más que son tan apasionados por la búsqueda como tú. Conectemos y trabajemos juntos para crear la experiencia mágica de búsqueda que te dará los resultados que deseas.

Pruébalo tú mismo