Multilingual image search with Jina CLIP v2 and Elasticsearch

Build a multilingual image search system using Jina CLIP v2 and Elasticsearch. Query your image collection in 89 languages with no translation pipeline, and use Matryoshka Representations to cut index size by 75%

Elasticsearch has native integrations to industry leading Gen AI tools and providers. Check out our webinars on going Beyond RAG Basics, or building prod-ready apps Elastic Vector Database.

To build the best search solutions for your use case, start a free cloud trial or try Elastic on your local machine now.

In a previous article, we explored alternatives to OpenAI's Contrastive Language–Image Pre-training (CLIP) for multimodal search, including Jina CLIP v1. In this article, we take it further with Jina CLIP v2, a multilingual, multimodal embedding model that lets you search an image collection in 89 languages using the same Elasticsearch index and the same model. We'll also look at Matryoshka Representations, a v2 feature that lets you reduce your index size by 75%.

Prerequisites

  • Elasticsearch 9.x cluster (start a free trial)
  • Python 3.9+
  • Jina API key (free at jina.ai with 100K free tokens, enough for this demo)

You can follow along with the full notebook for the complete code.

Jina CLIP v1 versus v2

Before writing any code, it's worth understanding what changed. The headline feature is multilingual support, but there are several other meaningful improvements:

FeatureJina CLIP v1Jina CLIP v2
LanguagesEnglish only89 languages
Max image resolution224x224512x512
Text encoderJinaBERTJina XLM-RoBERTa
Matryoshka RepresentationsNoYes
Embedding dimensions7681024
Max text length512 tokens8192 tokens

The text encoder upgrade from JinaBERT to Jina XLM-RoBERTa is what enables multilingual support. You can now write a query in French and retrieve English-tagged images; the model maps both into the same embedding space.

With v2, queries up to 8,192 tokens are embedded in full; anything beyond that is truncated if the truncate option is enabled.

Setup

Elasticsearch as a vector database allows us to store and search dense embeddings natively. We use a dense_vector field with 1024 dimensions and cosine similarity, which is the right choice for CLIP-style embeddings, since cosine similarity normalizes vectors at index time:

Jina Embeddings API

We use the Jina Embeddings API, a REST API that handles both text and image inputs with the same model:

The dimensions parameter controls the output size and is key to Matryoshka support, which we'll cover at the end of this article. For now, we use the full 1024 dimensions.

Load the dataset

We use the StockImages-CC0 dataset, which contains around 4,000 CC0-licensed stock photos with descriptive tags. Images are 1200px wide, well above CLIP v2's 512x512 input size, so we resize them during embedding.

We select 20 diverse images covering different categories to keep the demo fast and the results easy to interpret:

Generate image embeddings

The following diagram illustrates the two-step pipeline: First, images are embedded with CLIP v2 and stored in Elasticsearch; and then, a text or image query is embedded with the same model and used for k-nearest neighbor (kNN) similarity search:

We encode all 20 images in a single API call. CLIP v2 models embed images and text into the same vector space, which is what makes text-to-image search possible:

Index documents

We use the Elasticsearch bulk helper to index all documents in one call:

We encode a text query using the clip-v2 model we used for the images and then run a kNN search against the image embeddings. Because Jina CLIP v2 maps text from all supported languages and images into the same embedding space, queries in different languages retrieve the same images:

We test with three query sets, each translated into English, Spanish, French, and Portuguese:

As you can see in the images below, all four language variants of each query return the same top results. The ranking scores are nearly identical across languages:

Beyond text queries, you can use an image as the query to find visually similar images. The approach is the same: Encode the query image into the embedding space, and run kNN search:

Let’s try an image search using the following image of the Eiffel Tower:

Results:

Using the Eiffel Tower as the query, the model returns the image itself, followed by a cathedral and a town with a hot air balloon; both are visually and semantically adjacent to an urban landmark. The vineyard and skatepark are less obvious matches; with only 20 images in the index, kNN always returns k results regardless of relevance.

Matryoshka Representations

Jina CLIP v2 supports Matryoshka Representation Learning (MRL). The idea is that the model is trained so that the first N dimensions of an embedding already capture most of the information, and you can truncate the rest. You get smaller vectors with minimal quality loss.

The Jina API exposes this directly via the dimensions parameter, which accepts any integer between 64 and 1024.

According to Jina's benchmarks, reducing from 1024 to 256 dimensions maintains over 99% of retrieval quality across text, image, and cross-modal tasks.

To use a reduced dimension, create a separate Elasticsearch index with dims set to your target size. Elasticsearch's dense_vector field is fixed at index creation; you can't query with a 256-dim vector against a 1024-dim index:

Now compare results between the 1024-dim and 256-dim indices:

These are the results:

The top results are the same at 256 and 1024 dimensions. In larger-scale deployments, 256-dim embeddings will reduce storage and query latency proportionally, making Matryoshka a practical optimization for production systems where index size matters. It’s important to always measure retrieval quality in your specific dataset.

The multimodal gap

It's worth noting that CLIP-style dual-encoder models have a known limitation called the multimodal gap: Text and image embeddings form separated clusters in vector space, which can make cross-modal similarity scores less reliable. Jina addressed this in jina-embeddings-v4 by replacing the dual-encoder architecture with a unified model, and a multimodal v5 is in development. If cross-modal alignment is critical for your use case, keep an eye on these newer models.

Conclusion

Jina CLIP v2 extends v1 with multilingual support across 89 languages, larger embeddings, higher image resolution, and Matryoshka embeddings that let you trade index size for a small quality loss. The API is similar, so you can use this model in the same way as the first version.

Next steps

How helpful was this content?

Not helpful

Somewhat helpful

Very helpful

Related Content

Ready to build state of the art search experiences?

Sufficiently advanced search isn’t achieved with the efforts of one. Elasticsearch is powered by data scientists, ML ops, engineers, and many more who are just as passionate about search as you are. Let’s connect and work together to build the magical search experience that will get you the results you want.

Try it yourself