Glossary

Data Augmentation

Artificially expanding a training dataset by creating modified versions of existing examples, such as paraphrasing sentences, back translation or adding controlled noise. Augmentation increases training diversity without requiring additional labeled data, helping models generalize to unseen inputs. In retrieval specifically, augmentation is used to generate synthetic queries from passages, improving embedding model training when real query-passage pairs are scarce. The main risk is that aggressive augmentation can distort meaning and introduce noise, degrading model quality if not carefully controlled.

Decoder

A decoder converts data into something more humanly useful — in the case of language models, it generates output text one token at a time. Decoders process text left to right, predicting each next token based on what came before. Models like GPT are decoder-only architectures.

Decoder-Only Model

A model built using only the decoder component of the transformer. Originally designed for text generation (GPT-style models), decoder-only architectures have recently been adapted for embedding tasks. They process text left to right, which requires different techniques to produce good embeddings compared to encoder-only models. The difference between encoder and decoder models is about how they're used, not fundamentally about their architecture.

Deduplication

Identifying and removing duplicate or near-duplicate items from a dataset or index. In training data, deduplication prevents the model from overfitting to repeated examples. In search indexes, it prevents users from seeing the same content multiple times.

Dense Vector

A vector where most or all values are non-zero. Neural network embeddings are dense vectors where every position carries meaningful information. This contrasts with sparse vectors, where most values are zero.

Dimension

The number of individual values in a vector. A 768-dimension embedding is a list of 768 numbers. Higher dimensions can capture more nuance but require more storage and computation. The choice of dimension is a trade-off between representational richness and practical cost.

Dimension Truncation

Using only the first N dimensions of an embedding instead of the full output. When a model is trained with Matryoshka representation learning, the most important features are concentrated in the earlier dimensions, making truncation a practical way to reduce storage and computation costs.

Dimensionality Reduction

Any technique that reduces the number of dimensions in an embedding while preserving as much useful information as possible. Lower-dimensional embeddings use less memory and enable faster search, at the potential cost of some accuracy.

Distance Metric

A function that quantifies how far apart two vectors are. Common options include cosine similarity, dot product, and Euclidean distance. The choice of metric affects search results and should match the metric used to train the model.

Document

In search and retrieval, any unit of content treated as a discrete retrievable item, ranging from a full web page or PDF to a single paragraph. The term is used loosely: a logical document is the original source, while the indexed unit is often a smaller chunk produced by splitting the source for embedding. Understanding which level document refers to in a given system is important, as retrieval operates on indexed chunks while results are often traced back to their source document.

Document Embedding

An embedding that represents a full document or long passage as a single vector. This is hard because one vector must capture the meaning of potentially thousands of words. Approaches include using long-context models or combining embeddings of individual passages.

Dot Product

A mathematical operation that multiplies corresponding values of two vectors and sums the results. It measures both similarity in direction and the magnitude of the vectors. When vectors are normalized to length 1, the dot product equals cosine similarity.

Downstream Task

Any specific task an embedding model is applied to after training: retrieval, classification, clustering, or semantic similarity. A good embedding model performs well across a range of downstream tasks without separate fine-tuning for each.