Glossary

Candidate Set

The initial set of documents or passages returned by the first-stage retrieval for the reranker to evaluate. A larger candidate set gives the reranker more to work with but takes longer to process. Typical sets range from 50 to 1,000 items.

Checkpoint

A saved snapshot of a model's weights at a specific point during training. Checkpoints let researchers resume training, compare different stages, or release intermediate versions. When a model is published, the release is typically a specific checkpoint.

Chunking

Splitting a long document into smaller segments before embedding. Embedding models have limited context windows, and shorter passages tend to produce more focused embeddings. Chunking is a standard preprocessing step. Chunk size, overlap, and where you place the boundaries all affect retrieval quality.

Classification Head

A small neural network layer added on top of an embedding model to perform classification. The embedding model produces a vector; the classification head maps it to a category label.

CLIP (Contrastive Language-Image Pre-training)

A multimodal model developed by OpenAI that aligns text and image representations in a shared embedding space. CLIP was trained on a large dataset of image-text pairs using contrastive learning. It became a foundational architecture for multimodal search and influenced many subsequent models.

CLS Pooling

A pooling method that uses the output of the special [CLS] token (typically the first token) as the embedding for the entire input. This relies on the model learning to aggregate all relevant information into that single position. It was the original approach used in BERT-based models.

Clustering

Grouping items by proximity in embedding space using unsupervised methods, meaning no predefined categories are required. Common algorithms include k-means, which requires specifying the number of clusters in advance, and HDBSCAN, which discovers cluster count automatically and handles noise. Clustering is useful for discovering latent structure in data: topic groups in document collections, customer segments or patterns in logs. Results depend heavily on embedding quality and algorithm choice.

Code Embedding

A dense vector representation of source code that captures functional behavior, syntax, and structure rather than surface-level token similarity. This enables retrieval of functionally similar code regardless of variable names or even programming language, documentation-to-code matching, and code search within large repositories. Models like CodeBERT and UniXcoder are specifically designed for code representation, while several general embedding models also handle code effectively.

ColBERT (Contextualized Late Interaction over BERT)

A retrieval model that represents queries and documents as sets of token-level vectors rather than single vectors. Matching is performed by comparing each query token against all document tokens and taking the best matches. This multi-vector approach is more accurate than single-vector models and more efficient than cross-encoders.

Context Length / Context Window

The maximum number of tokens a model can process in a single input. For embedding models, this determines the longest text you can embed in one pass. If the input exceeds the context length, it must be truncated or split into smaller pieces. Models with longer context windows can handle full documents without chunking.

Contrastive Learning

A training approach where the model learns by comparing pairs or groups of examples. It pulls similar items (positive pairs) closer together in embedding space and pushes dissimilar items (negative pairs) further apart. This is the dominant training paradigm for modern embedding models.

Contrastive Loss

A loss function that penalizes the model when similar items are far apart, or dissimilar items are close together. It directly shapes the geometry of the embedding space by encouraging meaningful clustering of related content.

Corpus

The complete collection of content that a system operates over, either for retrieval or training. In search, the corpus is a fixed, bounded, indexed dataset — analogous to a knowledge base but emphasizing scope and completeness rather than organizational structure. In machine learning, the training corpus defines the linguistic and semantic patterns a model can learn, and determines the boundaries of what the system knows. Description: The complete collection of content a system searches over or was trained on. In search, a fixed indexed dataset; in ML, the data from which a model learns its parameters and representations.

Cosine Similarity

A way of comparing two vectors by measuring the angle between them, ignoring their length. The result ranges from -1 (opposite) through 0 (unrelated) to 1 (identical direction). It is the most commonly used similarity metric for text embeddings because it measures the direction of meaning rather than magnitude.

Cross-Attention

An attention mechanism in which queries from one sequence attend to keys and values from another, relating the two inputs. It is central to encoder-decoder architectures, where the decoder attends to the encoder output while generating a response, and to cross-encoder rerankers, where a query and document are processed jointly to produce a relevance score.

Cross-Encoder

An architecture where the query and document are processed together as a single combined input, allowing deep interaction between them. Cross-encoders produce a relevance score rather than separate embeddings. They are more accurate than bi-encoders but much slower, since every query-document pair must be processed from scratch.

Cross-Encoder Reranker

A reranker that concatenates the query and each candidate document into a single input, allowing the model to capture deep interactions between them through cross-attention. This produces highly accurate relevance scores: for instance, recognizing that 'How tall is the Eiffel Tower?' is answered by 'The iron lattice structure stands 330 meters' despite no word overlap. Because document representations cannot be precomputed, cross-encoders are too slow to search the full corpus and are applied only to the small candidate set returned by first-stage retrieval.

Cross-Lingual Retrieval

A retrieval pattern where a query in one language returns relevant results in another, without requiring translation at query time. It depends on embedding models trained to align representations across languages in a shared semantic space, such as mE5, LaBSE, or Jina Embeddings. Performance varies across language pairs: high-resource pairs like English and German perform well, whereas low-resource languages remain challenging for most models.