LLM chunking & snippet extraction: Extract the best content for LLMs

Elasticsearch is packed with new features to help you build the best search solutions for your use case. Learn how to put them into action in our hands-on webinar on building a modern Search AI experience. You can also start a free cloud trial or try Elastic on your local machine now.

These days, if you’ve performed semantic searches or context engineering, you’ve probably worked a lot with chunks. If you’re not familiar with chunks, a chunk is a small, meaningful piece of content extracted from a larger document. This blog provides a great foundational overview of chunking, why it’s important, and various chunking strategies.

For this blog, we want to focus on one specific problem in the topic of chunking, and that is defining the best context to send to a large language model (LLM) or other model. Models have a limited number of tokens they can take in as context, but even within that limit, sending in large amounts of content can lead to relevance degradation caused by factors such as context rot or “lost in the middle” problems, where important information hidden in large blocks of text is overlooked.

This led to the question: How can we make this better?

Reranking in retrievers

We started by looking at retrievers, specifically the text_similarity_reranker retriever. We know that many cross-encoder rerankers do not perform well on long documents. This is because these rerankers will truncate long content to the model’s token window, discarding the remaining content. This can actually lead to degraded search relevance, if the most relevant part of the document is cut off before being sent to the reranker!

We decided to address this by introducing a chunk_rescorer to the text_similarity_reranker retriever. When specified, rather than sending in the entire document to the reranker, we will chunk the document first and evaluate each chunk based on the reranking inference text. We do this by indexing each chunk into a temporary in-memory Lucene index and performing a BM25 text search over these chunks. We return the best chunks for consideration into the reranker.

The chunk rescorer is simple to use with a small update to the API call:

When we evaluated the chunk rescorer, we found a significant improvement for many truncating models, including the Elastic Reranker and Cohere's rerank-english-v3.0 model. However, when we evaluated against jina-reranker-v2-base-multilingual, the results were not as impressive due to the fact that Jina already addresses this long document problem internally.

We performed evaluations using the Multilingual Long-Document Retrieval (MLDR) English dataset. This is a document containing very long article content that would trigger this document truncation issue in many reranking models. The following table shows our evaluation results with BM25 text search and a rank_window_size of 32:

Reranker model	NDCG@10	NDCG@10 NDCG@10 with chunk rescoring
jina-reranker-v2-base-multilingual	0.771145	0.764488
Cohere rerank-english-v3.0	0.592588	0.707842
.rerank-v1-elasticsearch	0.478121	0.751994

It’s worth noting that the raw BM25 results without reranking had a Normalized Discounted Cumulative Gain (NDCG) score, or relevance score, close to 0.64. (Find additional background in this paper.) This means that for rerankers that perform truncation, reranked results for long documents were actually worse than without reranking. Note that this only applies for long documents; shorter documents that fit into the token window would not be affected by this long document problem.

Of the rerankers we evaluated, Jina was the only reranker to perform well against long documents out of the box, thanks to its sliding window approach.

We saw better baseline performance but similar overall difference in results when using semantic_text fields with Elastic Learned Sparse EncodeR (ELSER).

We felt the results for truncating models were promising enough to release the chunk rescorer as an opt-in feature for models that will benefit from the additional relevance, but we recommend evaluating against specific rerankers before implementing this in production.

ES|QL

The real power of chunk extraction, however, lies in the Elasticsearch Query Language (ES|QL). We wanted chunks and snippets to be first class citizens in ES|QL so they could be easily extracted and repurposed for reranking, sending into LLM context, or other purposes.

We started by introducing the CHUNK function in Elasticsearch version 9.2:

CHUNK is an extreme primitive that takes some string content (a text field, a semantic text field, or any other row content that is a string) and chunks it. You can view and interact with these chunks, and you can also explore using different chunking settings:

You can then combine chunk with existing primitives, like MV_SLICE and MV_EXPAND, to format the way chunks are represented in your row output:

This is great, but what we really wanted was to get the top matching snippets for a query, so we also introduced TOP_SNIPPETS in Elasticsearch version 9.3:

We added support to control the number of snippets you want to return and the word size using a sentence-based chunking strategy:

This fits into the broader story of LLMs when you add in COMPLETION. Here is an example of how we envision TOP_SNIPPETS integrating with LLMs:

In this example, we’re performing a semantic search, but for each document we’re identifying the top snippets from that document. We’re sending in the highly relevant snippets into the completion command, rather than the entire document. This is a simple document, but you could also use reranking here, and in the future, when multiple forks are available, hybrid search will be supported in the same format.

We can also utilize snippets in the newest version of RERANK:

What we’re thinking about next

The story isn’t over for chunking and snippet extraction; in fact, it’s only getting started.

We’re looking at how to best integrate existing semantic_text chunks out of the box into strategies using chunking and snippet extraction. We’re also exploring what other features we need to make snippet extraction a compelling feature to use in products such as Elastic Agent Builder.

Overall, we’re excited to share these tools and look forward to your feedback as we evolve our strategies for getting the best context for LLMs!

Report an issue