Jina-VLM: multilingual VQA benchmarks and ICLR 2026

Get hands-on with Elasticsearch: Dive into our sample notebooks in the Elasticsearch Labs repo, start a free cloud trial, or try Elastic on your local machine now.

Jina-VLM is a 2.4B-parameter vision-language model that currently leads open 2B-scale models on multilingual VQA benchmarks (MMMB and Multilingual MMBench) across 29 languages. It pairs a SigLIP2 vision encoder with a Qwen3 language decoder and handles arbitrary-resolution inputs without sacrificing token efficiency. Jina by Elastic engineers presented the model at the DATA-FM workshop at ICLR 2026 in Rio. This post covers the architecture, the training approach and what five days at the conference told us about where retrieval, embeddings and reasoning are headed.

Andreas Koukounas (left) and Georgios Mastrapas (right) presenting Jina-VLM at the poster session. — Jina poster presentation at the DATA-FM workshop.

jina-vlm is a 2.4B-parameter vision-language model that pairs a SigLIP2 vision encoder with a Qwen3 language decoder, using attention pooling over image tiles for token-efficient handling of arbitrary-resolution inputs. Beyond the model itself, the paper's main contribution is its “leave-one-out” ablative data-mixture: By removing one task, domain, modality, or language category at a time during training, you can figure out which slices of data are significant or redundant, and whether learning in one domain transfers to others. The result is a compact model that, despite its size, achieves state-of-the-art multilingual VQA performance.

Rio delivered everything you'd hope for: warm, sunny beach weather, the easy walk between Copacabana and Ipanema, the view from Christ the Redeemer, the colors of Escadaria Selarón. A welcome contrast to a still-chilly European spring.

Gallery of photos from Rio de Janeiro — Georgios’ photos of Rio de Janeiro from the trip

Conferences like ICLR give everyone a chance to take the field’s pulse and find out what’s hot, what’s not, and what’s coming up. After a few days of walking the aisles at the poster sessions and dropping in on oral sessions, you start to get a sense for things. You start to see the same words on poster after poster, and you notice which sessions are the most crowded.

Here are a few things we picked up on:

Reinforcement Learning with Verifiable Rewards (RLVR) is now the dominant paradigm for post-training refinement. Almost every reasoning-focused poster we stopped at was using some form of Group Relative Policy Optimization (GRPO) for math correctness, code execution, and formal-logic checks, rather than Reinforcement Learning from Human Feedback (RLHF). Direct Preference Optimization (DPO) fine-tuning, which felt like the default a year ago, was conspicuously rare. It makes sense: If you can use code to check for correctness, you no longer need to get annotated data and the training loop goes much faster.

Test-time compute has stopped being a curiosity and become a design problem. Test-time compute – the time a system spends generating a response – is an increasingly important study variable. Papers now measure it as part of their experimental setup and developers try to optimize for it. Models are now built with the expectation that inference will be expensive and clever, not just a single forward pass through a neural network.

Vision-Language Models (VLMs) are everywhere, and Vision-Language-Action models (VLAs) are not far behind. A big chunk of the conference was about how to make multimodal AI work better, like better tokenization for images, better positional encodings for non-text media, and more efficient ways to compress visual information before it overwhelms your model. Vision-Language-Action models that extend multimodal AI recipes to robotics and embodied agents are no longer niche research. They brought in the crowds at their presentations and hosted vibrant debates.

Reports of the death of State-Space Models (SSMs) have been greatly exaggerated. Attention models still dominate AI, but Mamba, SSM variants and recurrent neural networks still draw attention and research, both as full replacements for Transformers and as components inside hybrid attention-based stacks. Whether they'll ever genuinely displace Transformers is an open question, but the line of research is alive and well.

Agentic AI safety is taken very seriously. A lot of papers and presentations discussed problems like machine unlearning and jailbreaking, and some of the most interesting work was on prompt injection through agentic tool use, like when a model dutifully follows instructions hidden in a webpage or an API response it just fetched. A repeated, slightly unsettling observation: models that follow instructions better tend to be more vulnerable to this kind of attack, not less. This capability-vulnerability tension is going to define a lot of the next few years of safety research.

Hallucination and factuality are increasingly framed as retrieval problems. Several talks made that point explicitly: A generative model that has to invent facts will inevitably hallucinate them, while a model that retrieves information can ground its responses in verifiable ways. That framing is, of course, exactly the bet that search AI engineers have been making all along.

ICLR 2026 invited talks: hidden universe imaging and open AI development

Two of the invited talks stood out to us, albeit for very different reasons:

Images of the Hidden Universe

Katie Bouman presented a tour of how physics, prior knowledge, and machine learning combine to reconstruct information that the universe never gives us directly, like the silhouettes of supermassive black holes and the invisible dark matter structures. She walked us through the Event Horizon Telescope's imaging of M87 and Sagittarius A, building images up from indirect and incomplete radio measurements, and then extended the same machinery to mapping dark matter through gravitational lensing.

This talk was a useful reminder of why machine learning matters outside the LLM bubble. The more you already know, the more you can learn from a little bit more information. This principle generalizes beyond astronomy to knowledge in general, and to machine learning in particular. Any decision system that uses sparse, noisy observations is confronted with it.

_____________________________________________________________________________________

Marin: Open Development of Frontier AI

Percy Liang opened his presentation with a blunt observation: As AI capabilities skyrocket, openness plummets. His response is Marin, a platform for community-driven AI research where every experiment is open, every suggestion or discussion is on public fora, and anyone can review or rerun a result.

What makes Marin interesting isn't just creating open weight models - plenty of projects do that - but creating an open process for making models. Project pre-registration, peer review, and reproducibility have long been part of the natural sciences, and Marin attempts to maintain that tradition for AI. Model training is treated as a matter of public scientific record.

The talk presented concrete scientific results from this approach (optimizer findings and scaling-law results), suggesting that community-scale science isn't just an aspiration but a workable methodology.

_____________________________________________________________________________________

Bouman and Liang made a pleasingly complementary pair: one a reminder of how much ML has to offer the world outside ML, the other a challenge to how the field organizes itself.

ICLR 2026 research highlights: embedding models, retrievers and sparse representations

We attended many oral presentations and poster sessions. The papers below stood out because of their potential to impact how we make and use embedding models.

Rethinking pretraining for representations

Decoder-only models have dominated the LLM leaderboards for years, but one paper makes a case for encoder models.

Seq vs Seq: An Open Suite of Paired Encoders and Decoders does a repeatable, open-data, architecture-controlled comparison of encoder-only and decoder-only models trained identically. They used the same data, same architecture, same training recipe, and differed only in their training paradigms: Bidirectional Masked Language Modeling (MLM), typically associated with encoders, vs. Causal Language Modeling (CLM), usually used in decoders. Their results confirm prior findings that encoders excel at classification and retrieval while decoders excel at generation. A key finding is that cross-objective continuous pretraining does not close the performance gap between the encoders and decoders. A 400M parameter encoder beats a 1B parameter decoder in classification and retrieval, and vice versa for generative tasks. All artifacts including data, checkpoints, and code are open-sourced.

Their study delivers a definitive empirical finding for the AI community: Encoder-only pretraining is substantially more efficient for classification and retrieval tasks than adapting decoders to act like encoders, even with post-training on high-quality data. This challenges the recent trend of adapting large decoder LLMs (like LLM2Vec) for embedding tasks. Dedicated encoder pretraining from scratch remains the most reliable path to strong retrieval performance. Additionally, the public release of 200+ checkpoints with batch-ordered training data makes their work an invaluable resource for studying how retrieval-relevant representations emerge during training and how they scale with parameter count and tokens.

New paradigms for training retrievers and embedders

Revela: Dense Retriever Learning via Language Modeling reframes dense retriever training as a language modeling problem. Rather than using supervised training with query-document pairs, it trains a retriever model jointly with a language model by conditioning next-token prediction on all the other documents in the batch. This innovative in-batch attention mechanism modifies the model’s Transformer blocks by injecting the similarity scores of documents in each batch into the cross-document attention weights. Training is done on raw text, without query-document pairs, hard negatives, or synthetic data generation. The resulting 3B parameter model outperforms E5-Mistral-7B-Instruct (with 7B parameters) as well as proprietary closed-weight embedding models like OpenAI, Cohere, and Voyage. On retrieval benchmarks, it matches E5 despite using roughly 1000 times less training data and approximately 10 times less compute.

This demonstrates that next-token prediction can still serve as an effective training objective for high-quality dense retrieval AI. This is important because plain text data – what you need for next-token prediction – is widespread and inexpensive and this paper shows that it’s all you need to train competitive embedding models.

Let LLMs Speak Embedding Languages: Generative Text Embeddings via Iterative Contrastive Refinement advances the proposition that LLMs should learn to "speak an embedding language," i.e., generate sequences of “soft tokens” optimized for semantic representation rather than human readability. They outline innovative loss functions and objectives in support of this goal, and show that the resulting models have very competitive performance, while generating only a handful of additional tokens.They also show that generating more tokens at inference time steadily improves embedding quality in a way analogous to chain-of-thought scaling in reasoning LLMs. KV-caching reduces the computational overhead of the generation process to within 1.1 times that of standard single-pass embedding models. This approach represents a new paradigm for representation learning, complementary to encoder-only and single-pass approaches.

Towards Improved Sentence Representations using Token Graphs frames the problem of generating embeddings for sentences from token-level representations as a relational learning problem rather than a compression problem. Instead of pooling tokens, it uses a supplementary neural network that processes a dynamically constructed graph made from output token similarities. This added network is compact, with very few trainable parameters, and can be implemented without doing any additional training on the main language model. The result is competitive with current frontier models.

This approach can be dropped into any language model at a very reasonable additional training cost, giving it immediate practical significance. Furthermore, the resulting models hold up well in the presence of noise, a known problem, especially for long-context models.

Sparse and ultra-efficient embeddings

LightRetriever: A LLM-based Text Retrieval Architecture with Extremely Faster Query Inference introduces an asymmetric dual-encoder architecture for embeddings-based retrieval in which the query encoder is much smaller and faster than the document one. The key insight is that while document embeddings benefit from the modeling power of a large language model, query embeddings are much less demanding. During training, they propose to learn per-token query embeddings, then, at query time, those embeddings are retrieved and averaged to produce a full query embedding. Documents must still be encoded at storage time using a potentially large encoder, but there is no need to invoke an embedding model at query time at all. The result retains approximately 95% of the performance of the query encoder it replaced. This has immediate implications for computational constrained, time-sensitive, or resource-efficient text information retrieval systems.

CSRv2: Unlocking Ultra-Sparse Embeddings addresses the computational cost of embeddings-based retrieval using dense, high-dimensional vectors. It tackles that cost with Contrastive Sparse Representation (CSR), which maps dense vectors into a much higher-dimensional space where only a few vector entries are non-zero, so that search can use highly efficient sparse-vector search techniques like inverted-indexes.

CSR approaches tend to break down when the number of dimensions with non-zero values becomes very low. This paper addresses this problem with an innovative training approach that makes ultra-sparse representations viable, opening up the possibility of much faster, less computationally demanding retrieval without loss of accuracy.

Multi-step and multimodal retrieval

Q-RAG: Long-Context Multi-Step Retrieval via Value-Based Embedder Training frames the problem of multi-step retrieval-augmented generation (RAG) in terms of optimizing the embeddings used in RAG search. RAG systems are typically based on a single retrieval step: Input to an LLM becomes a query to a vector store, and a selection of the results are presented to the LLM as a basis for composing a response. However, agentic approaches that involve multi-step interactions between the LLM and vector store can improve RAG performance significantly, especially for large input contexts that might contain millions of tokens. This paper seeks to optimize the embedding model used for retrieval to better support this usage scenario with Reinforcement Learning with Verifiable Rewards (RLVR).

This paper is one of the more elegant intersections of two of the conference's biggest themes — RLVR and retrieval — and it gives a glimpse of what retrieval looks like when it has to operate inside an agentic loop, not just before one.

Foundations and evaluation

HUME: Measuring the Human-Model Performance Gap in Text Embedding Tasks undertakes the unusual task of systematically measuring human performance on the Massive Text Embedding Benchmark (MTEB), the most widely used benchmark for embeddings-based information retrieval. Using 16 datasets in 5 languages, they find that average human retrieval accuracy is 77.6%, while the best embedding models currently score over 80%. However, this performance gap is uneven. Models may outperform humans on standard tasks but fall apart when faced with low-resource languages, where human intuition still holds a significant lead.

This paper also shows that "superhuman" scores on low-agreement tasks are mostly artifacts of fitting noise, not genuine capability. This underlines the problem of our current suite of embedding benchmarks: New models are not improving benchmark performance very much. To make progress, we need new, harder challenges and a total rethink of how we evaluate models.

Training dynamics for foundation models

How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining identifies a significant but underexplored problem in AI model training. Large training sets can create a problem with models forgetting things that they’ve learned as they’re presented with more data. Curriculum-based pretraining — sorting data from low to high quality — should help, but in practice the results have been disappointing. The reason, the authors argue, is that the model encounters the highest-quality data late in the training schedule when the learning rate is at its lowest. Its gradient contribution is therefore greatly reduced. They confirm that hypothesis empirically by showing that curriculum training significantly beats random shuffling if training uses a constant learning rate.

They propose two simple strategies to fix this: Let the learning rate decay more slowly, or replace learning rate decay with weight-averaging over the multiple final checkpoints. Combining the two yields a 1.64% average benchmark improvement over standard practices with no additional data refinement. The broader message - that data composition and optimization schedule need to be co-designed - applies well beyond pretraining, and is a useful frame for embedding training too.

What ICLR 2026 means for retrieval and embedding research

Science has always been conducted through print and publication, but in-person conferences are still the only way to put people together in a room. Over five days, we met a steady stream of researchers from very different backgrounds — academia and industry, large labs and small startups, half a dozen countries — and conversations ranged from research trends to philosophical questions that have haunted AI from the beginning. Are LLMs really reasoning, or are they doing something more like very high-dimensional memorization with interpolation? Where exactly is the line, and does it matter for what we can build on top of them?

These conversations rarely produce answers, but they sharpen the questions, which is most of what good research is.

For the information retrieval work we do at Jina by Elastic, the future looks bright. Retrieval, long relegated to merely applied research, is increasingly recognized as the engine for keeping language models grounded. Better encoders, better embedding training paradigms, sparser representations, and retrieval that operates at the core of reasoning loops – these things matter to us all. What we saw and heard at ICLR 2026 convinces us that this is where a meaningful share of the next round of progress will come from.

We're already looking forward to seeing where the field is next year.

Quão útil foi este conteúdo?

Não útil

Um pouco útil

Muito útil

Reportar um problema

Conteúdo relacionado

On-prem in under 5 minutes: Jina embedding models now available for on-prem deployment

Jina AI Integrations

23 de julho de 2026

On-prem in under 5 minutes: Jina embedding models now available for on-prem deployment

All 28 Jina AI models, including rerankers, as ready-to-deploy Docker containers, with zero telemetry and no license server. Drop-in compatible with OpenAI, Cohere, Voyage AI and Elastic Inference Service APIs.

Por: Scott Martens

A picture is worth 1.5x the words: What we learned benchmarking product search embeddings

Vector Database Relevance+1

16 de julho de 2026

A picture is worth 1.5x the words: What we learned benchmarking product search embeddings

We benchmarked two embedding models on 5,000 real products and found that combining image and text beats either alone by up to 50%. Here's the data and the model that won.

Por: Sofia Vasileva

How BBQ shrinks Jina v5 embeddings by 29x without losing recall in Elasticsearch

Vector Database Jina AI+1

10 de julho de 2026

How BBQ shrinks Jina v5 embeddings by 29x without losing recall in Elasticsearch

A hands-on test comparing BBQ and float32 vector indices in Elasticsearch, measuring memory, disk and recall@10 across five languages.

Por: Jeffrey Rengifo

jina-clip-v2 brings text-to-image search across 89 languages to Elasticsearch, no GPU needed

Jina AI Hybrid Search+1

23 de junho de 2026

jina-clip-v2 brings text-to-image search across 89 languages to Elasticsearch, no GPU needed

Run multimodal search across 89 languages inside Elasticsearch with jina-clip-v2: one embedding space for text and images, with no separate model infrastructure to manage.

KJ RD BJ

Por: Kapil Jadhav, Ranjana Devaji e Brendan Jugan

One index, all media: Introducing jina-embeddings-v5-omni

Jina AI

11 de maio de 2026

One index, all media: Introducing jina-embeddings-v5-omni

jina-embeddings-v5-omni lets you embed text, images, video, and audio into a single Elasticsearch index and query across all of them at once.