Chunking large documents via ingest pipelines plus nested vectors equals easy passage search

Learn how to chunk large documents using ingest pipelines and nested vectors in Elasticsearch for easy passage search in vector search.

Vector search is a powerful way to search data based on meaning rather than exact or inexact token matching techniques. However the text embedding models that power vector search can only process short passages of text on the order of several sentences rather than BM25 based techniques that can work on arbitrarily large amounts of text. Combining large documents seamlessly with vector search is now possible with Elasticsearch.

How does it work at a high level?

The combination of Elasticsearch features such as ingest pipelines, the flexibility of a script processor and new support for nested documents with dense_vectors allows for a straightforward way to at ingest time chunk large documents into small enough passages that can then be processed by text embedding models to generate all the vectors needed to represent the full meaning of the large documents.

Ingest your document data as you would normally, and add to your ingest pipeline a script processor to break the large text data into an array of sentence or other types of chunks followed by a for_each processor to run an inference processor on each chunk. Mappings for the index are defined such that the array of chunks is set up as a nested object with a dense_vector mapping as a subobject which will then properly index each of the vectors and make them searchable.

How to chunk large documents via ingest pipelines & nested vectors

Load a text embedding model

The first thing you will need is a model to create the text embeddings out of the chunks, you can use whatever you would like, but this example will run end to end on the all-distilroberta-v1 model. With an Elastic Cloud cluster created or another Elasticsearch cluster ready, we can upload the text embedding model using the eland library.

Mappings example

Next step is to prepare the mappings to handle the array of sentences and vector objects that will be created during the ingest pipeline. For this particular text embedding model the dimensions are 384 and dot_product similarity will be used for nearest neighbor calculations:

Ingest pipeline examples

The last preparation step is to define an ingest pipeline to break up the body_content field into chunks of text stored in the passages field. This pipeline has two processors, the first script processor breaks up the body_content field into an array of sentences stored in the passages field via a regular expression. For further research read up on regular expression advanced features such as negative lookbehind and positive lookbehind to understand how it tries to properly split on sentence boundaries, not split on Mr. or Mrs. or Ms., and keep the punctuation with the sentence. It also tries to concatenate the sentence chunks back together as long as the total string length is under the parameter passed to the script. The next for each processor runs the text embedding model on each sentence via an inferrence processor:

Add some documents

Now we can add documents with large amounts of text in body_content and automatically have them chunked, and each chunk text embedded into vectors by the model:

Search those documents

To search the data and return what chunk matched the query best you use inner_hits with the knn clause to return just that best matching chunk of the document in the hits output from the query:

Will return the best document and the relevant portion of the larger document text:

Review

The approach used here shows the power of leveraging the different capabilities of Elasticsearch to solve a larger problem.

Ingest pipelines allow you to preprocess your documents before indexing, and while there are many processors that do specific targeted tasks, sometimes you need the power of a scripting language to be able to do things like break up text into an array of sentences. Because you can access the document before it is indexed you have the ability to remake the data in nearly any fashion you can imagine as long as all the information is within the document itself. The foreach processor allows us to wrap something that may run zero to N times without knowing in advance how many times it needs to execute. In this case we are using it to run over as many sentences as we extract to run the infer processor to make vectors.

The mappings of the index are prepared to handle the array of now objects of text and vector that did not exist in the original document with a nested object which indexes the data in a way that we can properly search the document.

Using knn with nested support for vectors allows the use of inner_hits to present the best scoring portion of the document which can substitute for what would be usually done via highlighting in a BM25 query.

Conclusion

This hopefuly shows how Elasticsearch can do what it does best, just bring your data and Elasticsearch will make it searchable for you.

Take your skills to the next level and learn how to implement a recursive chunking strategy by watching this video.

Try out vector search for yourself using this self-paced hands-on learning for Search AI. You can start a free cloud trial or try Elastic on your local machine now.

Related content

Ready to build state of the art search experiences?

Sufficiently advanced search isn’t achieved with the efforts of one. Elasticsearch is powered by data scientists, ML ops, engineers, and many more who are just as passionate about search as your are. Let’s connect and work together to build the magical search experience that will get you the results you want.

Try it yourself