Intelligent RAG, Fetch Surrounding Chunks

In the realm of Retrieval-Augmented Generation (RAG), one persistent challenge is finding the optimal amount of data to feed into a Large Language Model (LLM). Too little data results in insufficient or inaccurate responses, while too much data leads to vague answers. This delicate balance inspired me to develop a notebook focusing on intelligent chunking and leveraging Elasticsearch vector database.

The Motivation

The primary motivation behind building this notebook was to demonstrate a refined approach to RAG by addressing the challenge of data chunking. Traditional methods often fall short in dynamically adjusting the data size fed to LLMs, either overwhelming the model with too much context or starving it with too little. This notebook aims to strike the right balance, providing just enough information for the LLM to generate precise and contextually relevant responses. However, it must be noted that there is no one-size-fits-all solution.

This method works especially well with books and similar texts where content flows within longer sections or chapters. However, it may require adaptation for texts structured into shorter, distinct sections, such as research papers or articles, where each segment might cover a different topic. In such cases, additional strategies may be necessary to effectively chunk and retrieve related content.

The Methodology

Fetch Surrounding Chunks

The core idea is to partition the source text into manageable chunks, ensuring each chunk contains just the right amount of information. For this demonstration, I used text from "Harry Potter and the Sorcerer's Stone." The text was partitioned into chapters, and each chapter was further divided into smaller chunks. These chunks, along with their dense and sparse (ELSER) vector representations, were indexed in the Elasticsearch vector database.


Assigning Numbers to Chunks

Each chunk within a chapter was assigned a sequential integer, allowing us to identify its position. When a matching chunk is found, the chapter number and chunk number are used to retrieve surrounding chunks, providing additional context for the LLM.

Vector Database in Elasticsearch

These chunks and their vector representations were ingested into an Elasticsearch Cloud instance. Elasticsearch's robust vector search capabilities make it ideal for hosting these chunks, allowing for efficient retrieval of the most relevant chunks based on the semantic content or text match of a user's query.

To retrieve the relevant chunks, I employed a hybrid search strategy using dense vector comparisons, sparse vector comparisons, and text search in parallel. This multi-faceted approach ensures that the search results are both semantically rich and contextually accurate. A query is issued to find the matched chunk, which returns the chunk number and chapter. Surrounding chunks for that chapter are then fetched based on the matched chunk.


The RAG Pattern

When a query is made, the search flow performs the following steps:

  1. Query Analysis: The user's query is translated into dense and sparse vectors to retrieve the most relevant chunks from the Elasticsearch index.
  2. Chunk Retrieval: Using the AI search strategy, the system retrieves the top relevant chunks.
  3. Contextual Expansion: Adjacent chunks (n-1 and n+1) are also retrieved to provide a more comprehensive context. If the chunk is the last in the chapter, it fetches n-1 and n-2; if it's the first, it fetches n+1 and n+2.
  4. LLM Response: These intelligently selected chunks are then fed into the LLM, ensuring it receives the optimal amount of information to generate a precise and contextually relevant response.

Why This Matters

This approach addresses a critical aspect of RAG by optimizing the input data fed to LLMs. By leveraging intelligent chunking and hybrid semantic search, this method enhances the accuracy and relevance of the responses generated by LLMs. It showcases a pattern that can be widely applied in various applications within the RAG space, from customer support to content generation and beyond.


This notebook underscores the importance of intelligent data chunking in the RAG framework and demonstrates how Elasticsearch vector database can be leveraged to achieve optimal results. By ensuring the LLM receives just the right amount of information, this methodology paves the way for more accurate and contextually rich responses, enhancing the overall effectiveness of RAG systems.

Ready to try this out on your own? Start a free trial.
Looking to build RAG into your apps? Want to try different LLMs with a vector database?
Check out our sample notebooks for LangChain, Cohere and more on Github, and join Elasticsearch Relevance Engine training now.
Recommended Articles
LangChain and Elastic collaborate to add vector database and semantic reranking for RAG
Generative AIIntegrations

LangChain and Elastic collaborate to add vector database and semantic reranking for RAG

Learn how LangChain and Elasticsearch can accelerate your speed of innovation in the LLM and GenAI space.

Max Jakob

How to Set Up LocalAI for GPU-Powered Text Embeddings in Air-Gapped Environments
Generative AIHow ToIntegrations

How to Set Up LocalAI for GPU-Powered Text Embeddings in Air-Gapped Environments

With LocalAI you can compute text embeddings in air-gapped environments. GPU support is available.

Valeriy Khakhutskyy

OpenAI function calling with Elasticsearch
Generative AI

OpenAI function calling with Elasticsearch

Explore OpenAI's function calling capabilities, allowing AI models to interact with external APIs and perform tasks beyond text generation. Learn to implement dynamic function calls, including fetching data from Elasticsearch, enhancing the model's real-time data access and complex operation handling. Discover practical use cases and step-by-step integration in this insightful blog.

Ashish Tiwari

Using NVIDIA NIM with Elasticsearch vector store
Generative AIIntegrationsHow To

Using NVIDIA NIM with Elasticsearch vector store

Explore how NVIDIA NIM enhances applications with natural language processing capabilities. NVIDIA NIM offers features such as in-flight batching, which not only speeds up request processing but also integrates seamlessly with Elasticsearch to boost data indexing and search functionalities.

Alex Salgado

Elasticsearch open inference API adds Azure AI Studio support
IntegrationsHow ToGenerative AIVector Search

Elasticsearch open inference API adds Azure AI Studio support

Elasticsearch open inference API adds support for embeddings generated from models hosted on Azure AI Studio and completion tasks from large language models such as Meta-Llama-3-8B-Instruct."

Mark Hoy