What is full-text search (FTS)? A technical and strategic guide

How full-text search works: The inverted index explained

Full-text search gets its speed from a data structure called an inverted index. Think of it like the index at the back of a textbook. Instead of reading every page to find a topic, you look up a keyword and instantly get a precompiled list of every document where it appears.

This structure has two core components:

  • The dictionary: An alphabetized list of every unique term identified across the entire document collection.
  • The postings list: A detailed record for each term containing the document IDs where the word appears, how frequently it appears in each document, and the exact word positions, which are critical for proximity queries (finding words near each other).

By jumping directly to the relevant entries in the postings list, search engines like Elasticsearch provide near-instant retrieval across petabytes of information.


The analysis pipeline: Techniques and methods of text processing

Before a document enters the inverted index, it passes through an analysis pipeline that breaks raw text down into searchable "tokens." This process ensures the engine can match different variations of the same word, significantly increasing the recall of your search results.

Key stages include:

  • Tokenization: Breaking raw text into individual units or "tokens" (usually words)
  • Normalization: Converting all tokens to a standard format by lowercasing and stripping punctuation, so "Search" and "search" are treated as the same term.
  • Stemming and lemmatization: Reducing words to their root form so "running" and "ran" both become "run," catching different tenses and pluralizations.
  • Stop word elimination: Filtering out common filler words like "the," "is," and "a" that don't help distinguish one document from another, keeping the index lean and fast.

To learn more about implementing these stages, view our full-text filter tutorial.


Advanced query types: Fuzzy, proximity, and Boolean

Modern search goes well beyond simple keyword matching. Advanced query types let developers fine-tune exactly how strict or flexible the search parameters should be:

  • Fuzzy search: Uses Levenshtein Distance algorithms to find matches that are similar but not identical to the search term, effectively catching human misspellings
  • Proximity search: Finds terms within a specific distance of each other (e.g., "Abraham" within three words of "Lincoln")
  • Boolean logic: Employs AND, OR, and NOT operators to build precise filters — for example, surfacing only documents that contain "Cloud" and "Security" but NOT "AWS"

Explore more in our guide to search approaches.


Relevancy scoring: Understanding the BM25 algorithm

In full-text search, not all matches are equal. Relevancy scoring assigns a numerical value (a _score) to each result to determine the order you see them in.

The industry standard for this is Okapi BM25. What makes it smarter than older models is "term frequency saturation," meaning a word appearing 100 times in a document doesn't unfairly outweigh a document where it appears just 10 times. It also adjusts for document length, so a short, focused article isn't buried under a sprawling report on the same topic.

Three factors drive the score:

  • Term frequency (TF): The more often a term appears in a document, the more relevant it is, but with diminishing returns. BM25 recognizes that a word appearing 50 times isn't 50x more important than one appearing once.
  • Inverse document frequency (IDF): Rare terms carry more weight. A search for "Mitochondria" means more than a search for "Problem."
  • Field-length normalization: A match in a short title is generally more meaningful than the same match buried deep in a 50-page PDF.

Deep dive: Want to master the math behind the results? Read: What is search relevance?


Full-text search vs. vector search: Use cases for hybrid discovery

Lexical (full-text) search is the gold standard for exact keyword matching and specialized terminology. Vector search takes a different approach, using AI models to understand the conceptual intent behind a query. The most powerful modern systems use hybrid search, combining both methods to get keyword precision and semantic depth.

FeatureFull-text search (lexical)Vector search (dense)Hybrid search
Primary logicKeyword/token matchingMathematical "distance" in vector spaceCombined score (RRF)
StrengthsExact names, SKU numbers, typosConceptual meaning, image/audio searchBest of both worlds
ComplexityLower/establishedHigher/requires ML modelsBalanced for modern AI apps

Discover the future of discovery in our AI search and vector database guide.


Why full-text search is critical for AI

As organizations adopt generative AI and retrieval augmented generation (RAG), the quality of what a large language model (LLM) produces depends entirely on the context fed into it.

Full-text search is a foundational piece of the RAG pipeline. By using FTS to pull the most relevant "ground truth" data from your internal documents, you give your AI models something solid to work with, ensuring they produce accurate responses grounded in your actual data.


The engineering behind FTS: The inverted index

The reason Elasticsearch is so much faster than a standard relational database for text retrieval comes down to one structural difference: the inverted index.

A standard database uses a forward index, mapping a document ID to its content. Finding a word means scanning every row. An inverted index flips that relationship entirely, mapping every unique term to a postings list containing the document IDs, frequency, and exact positions where that term appears.

The analysis pipeline

Before a document enters the inverted index, it goes through a transformation process called analysis, which consists of:

  • Character filters: Stripping HTML tags or converting special characters
  • Tokenization: Breaking strings into individual terms (e.g., "The quick brown fox" becomes ["the", "quick", "brown", "fox"])
  • Token filters: Applying linguistic logic like stemming ("running" → "run"), lowercasing, and stop word removal (dropping "and" or "the" to save index space)