Unsupervised document clustering with Elasticsearch + Jina embeddings

A practical, reproducible approach to unsupervised document clustering with Elasticsearch and Jina embeddings.

From vector search to powerful REST APIs, Elasticsearch offers developers the most extensive search toolkit. Dive into our sample notebooks in the Elasticsearch Labs repo to try something new. You can also start your free trial or run Elasticsearch locally today.

Vector search starts with a query, but what if you don't have one?

Organizations accumulate large document collections, like support tickets, legal filings, news feeds, research papers, and need to understand what's in them before they can ask the right questions. Without labels or training data, manually reviewing thousands of documents is impractical. Traditional search doesn't help when you don't know what to search for.

This post walks through an Elasticsearch-native approach to unsupervised document clustering and temporal story tracking that addresses this discovery problem. By the end, you can trace story arcs like this across days:

What you'll discover:

  • Why clustering embeddings (not retrieval embeddings) matter when you want topic discovery without a query.
  • How density-probed centroid classification groups documents by topic using Elasticsearch k-nearest neighbor (kNN) and batched msearch.
  • How significant_text can auto-label clusters so themes are readable without training a model.
  • How temporal story chains link daily clusters to show how themes evolve from day to day.

This post is generated from a runnable Jupyter Notebook. The inline outputs you see throughout are real results from the pipeline. Clone the companion notebook to run it yourself.

The pipeline uses ~8,500 February 2025 articles from BBC News and The Guardian as a test corpus. News is convenient because it has clear temporal behavior, but the pattern applies anywhere document discovery matters: legal review, compliance monitoring, research synthesis, customer support triage.

Stack:

  • Jina v5 clustering embeddings: Task-specific Low-Rank Adaptation (LoRA) adapters for topic grouping. Jina has joined Elastic, and its models are available natively through Elastic Inference Service (EIS).
  • Elasticsearch: Scalable kNN, significant_text labeling, and vector storage.
  • DiskBBQ: A disk-based vector index format that combines Better Binary Quantization (BBQ) with hierarchical k-means partitioning for approximate nearest neighbors (ANN) acceleration. This index partitioning is internal to vector search and separate from the density-probed clustering algorithm used in this post. bbq_disk stores quantized vectors on disk and keeps only partition metadata in heap, dramatically reducing resource requirements, compared to bbq_hnsw, while maintaining high recall.
  • Global clustering + daily temporal linking: Discovery and story evolution.

What you'll need:

  • An Elasticsearch deployment (Elastic Cloud, Elasticsearch Serverless, or Elastic Self-Managed 8.18+/9.0+): bbq_disk requires 8.18 or later. The optional diversify retriever section requires 9.3+ or serverless.
  • A Jina API key: The free tier includes 10 million tokens, which covers the core clustering pipeline (~4.25 million tokens). The optional retrieval-versus-clustering comparison uses a second embedding pass.
  • A Guardian API key (free).

Setup

Install required packages:

Optional (only if you run scraping helpers from this repo):

Then configure API keys in a .env file at the project root:

This notebook calls load_dotenv(override=True), so local .env values take precedence.

Part 1: Discovery clustering - Why clustering embeddings?

Most vector search uses retrieval embeddings trained to match a query to relevant documents. That's perfect for search, but not for discovery. When you want to find what topics exist in a corpus without any query at all, you need embeddings that group similar documents together.

Jina v5 solves this with task-specific Low-Rank Adaptation (LoRA) adapters. LoRA adds small low-rank updates to targeted internal layers while keeping most base-model weights frozen, so the model behavior shifts toward a specific task without full retraining. The same base model produces different embeddings depending on the task parameter:

TaskTrained forUse case
retrieval.passageQuery-document matchingSearch, retrieval augmented generation (RAG)
clusteringTopic grouping (optimized for tight clusters)Discovery, categorization

The clustering adapter is trained to make documents about the same topic closer in embedding space and documents about different topics further apart. The visual comparison below makes the difference concrete.

Retrieval vs. clustering: A visual comparison

To see the difference, a sample of documents is embedded with both task types. Clustering is performed in the original 1024-dimensional embedding space; Uniform Manifold Approximation and Projection (UMAP) is used only to project those embeddings into 2D for visualization. UMAP preserves local neighborhood structure, making it useful for comparing cluster separation.

Below, the same 480-document sample is embedded with both task types and projected to 2D with UMAP. Look for tighter, more separated color groups in the clustering panel.

Retrieval embeddings (left) spread topics broadly; clustering embeddings (right) produce tighter, more separated groups from the same documents.

The clustering embeddings produce tighter, more visually distinct groups. The retrieval embeddings spread topics out more evenly, ideal for search (fine-grained similarity); but for discovery, tight topical clusters are what matters.

This is why task="clustering" is used for the rest of this walkthrough.

Loading the dataset

The corpus combines two news sources for February 2025:

Having multiple sources helps validate that clustering finds topics rather than source-specific style.

Embedding with the clustering task

The Jina v5 API is called with task="clustering" for all documents. Embeddings are cached to disk, so subsequent runs skip the API entirely.

The API call is straightforward. The task parameter is the key difference from typical embedding usage:

The timing below reflects a cache hit. First run against the API takes longer, depending on corpus size.

Indexing into a single Elasticsearch index

For discovery clustering, the full month goes into one index (docs-clustering-all). Daily partitioning comes later for temporal story linking.

The index mapping uses bbq_disk for the vector field:

A 1024-dimensional float32 vector is 4 KB. bbq_disk uses hierarchical k-means to partition vectors into small clusters, binary-quantizes them, and stores the full-precision vectors on disk for rescoring. Only partition metadata lives in heap, so memory requirements stay low even for large corpora. For workloads that can afford more heap, bbq_hnsw builds a Hierarchical Navigable Small World (HNSW) graph for faster lookups at higher resource cost.

The dense_vector field type supports multiple quantization strategies: bbq_disk and bbq_hnsw are the best fits for high-dimensional embeddings like the 1024-dim vectors used here.

Clustering: Density-probed centroid classification

Traditional clustering algorithms like HDBSCAN assume you can hold the full N×d vector matrix in memory and run repeated full-pass updates. For 8,495 documents at 1024 dimensions, that's manageable (~35 MB), but the approach doesn't scale to millions of documents without additional infrastructure.

This algorithm is conceptually similar to KMeans++ initialization with Voronoi assignment and a noise floor, but it uses Elasticsearch kNN search as the compute primitive, keeping almost all work server-side:

  1. Sample 5% of documents as density probes (random sample, minimum 50).
  2. Probe density via batched msearch kNN. Each probe fires a kNN query and records the mean similarity of its neighbours. High mean similarity = dense region of embedding space. msearch sends multiple search requests in a single HTTP call, which is critical here: Density probing generates hundreds of kNN queries, and batching them avoids per-request overhead.
  3. Select high-density seeds with diversification: Candidates above median density are sorted by density descending and greedily accepted only when their cosine similarity to every existing seed is below a separation threshold. This is the only client-side compute (~0.01s for 8k docs).
  4. Classify all docs against centroids via msearch kNN: Each seed acts as a centroid; a kNN search retrieves nearby documents above a similarity threshold. Each document is assigned to whichever centroid returned it with the highest score. Small clusters are dissolved to noise.

Elasticsearch handles the heavy lifting: msearch for density probes, msearch for classification, and significant_text for labeling. For this corpus (8,495 docs), the 5% density-probe sample launches 425 kNN probe queries, which msearch batches into nine HTTP calls (at batch size 50), avoiding one-request-per-probe overhead. Combined with bbq_disk ANN lookup, this keeps the clustering stage fast and scalable. The kNN queries use a minimal num_candidates value for speed during the clustering pass; production search queries should use higher num_candidates values to improve recall at the cost of latency.

Clusters have natural sizes determined by the embedding space density around each centroid, not by a hard k cap. Dense topic regions produce larger clusters; niche topics produce smaller ones.

Why not KMeans or HDBSCAN?

KMeans assumes spherical clusters and requires the full N×d matrix in memory. For corpora that fit in memory, HDBSCAN is a strong alternative. It handles arbitrary cluster shapes and has well-understood density semantics.

The density-probed centroid approach targets a different niche: corpora where you want storage, retrieval, and clustering in one system, or where scale makes client-side matrix operations impractical. It uses Elasticsearch kNN as the compute primitive, handles arbitrary cluster sizes, and keeps nearly all computation server-side.

Understanding the noise rate

The ~28% noise rate is by design, not a failure mode. Documents that don't fit any dense cluster at the configured similarity_threshold are left unassigned rather than forced into a poor match. This acts as a quality gate: Opinion columns, short articles, and one-off stories naturally resist clustering because they lack the thematic density that defines a coherent group.

The threshold is tunable: Lowering similarity_threshold produces more aggressive clustering (more documents assigned, but looser clusters), while raising it tightens clusters and increases the noise fraction. For this corpus of mixed news content, ~30% noise is a reasonable operating point. Production deployments should tune the threshold against domain-specific quality criteria.

Automatic labels with significant_text

Now each cluster needs a human-readable label. Elasticsearch's significant_text aggregation finds terms that appear unusually often in a foreground set (the cluster) compared to a background set (the full corpus).

Under the hood, it uses a statistical heuristic (JLH score by default) that balances absolute and relative frequency shifts, no machine learning, no large language model (LLM) calls. A cluster about UK politics might surface terms like starmer, labour, downing because those terms are disproportionately common in that cluster compared to the overall news corpus.

For this global pass, labels are computed directly against docs-clustering-all, so both foreground and background are drawn from the full month. In Part 2, labeling uses the daily index pattern (docs-clustering-*), a wildcard that lets queries span all matching indices simultaneously, to give significant_text a broader background for better contrast.

A minimal query shape looks like this:

significant_text also serves as a quality gate: Clusters that produce no significant terms have no distinguishing vocabulary. They're incoherent groupings that should be dissolved back to noise rather than given a misleading label.

A lightweight deterministic cleanup step removes noisy label terms (numeric tokens, generic words) and falls back to a representative headline when needed. This keeps labels Elasticsearch native while improving readability.

Visualizing the clusters

The visualizations below show what the global clustering pass discovered: a date-wise breakdown of clustered versus noise documents, a UMAP projection of the full month, and a source-mix chart confirming that clusters reflect topics rather than sources.

Daily distribution of clustered versus noise documents across February 2025.

Each colored island in the UMAP represents a cluster: a group of articles about the same topic discovered purely from embedding similarity. The gray noise points are articles that didn't fit cleanly into any cluster (often short articles, opinion pieces, or one-off stories).

The source breakdown chart confirms that clusters contain articles from both BBC News and The Guardian. The clustering is finding topics, not sources, exactly what unsupervised discovery should produce.

Exploring cluster breadth with the diversify retriever

Plain kNN returns the documents most similar to a cluster's centroid (the dense core). But real clusters cover subtopics. The diversify retriever uses Maximal Marginal Relevance (MMR) to surface documents that are relevant to the centroid but also different from each other.

The key parameter is λ (lambda):

  • λ = 1.0 → pure relevance (same as plain kNN).
  • λ = 0.0 → pure diversity (maximally spread results).
  • λ = 0.5 → balanced: that is relevant to the topic, but covering different angles.

Version note: The diversify retriever is available on Elastic Cloud Serverless and Self-Managed Elasticsearch 9.3+. Earlier versions can still follow the clustering and temporal linking sections; only this exploration step requires the diversify retriever.

A minimal retriever request shape looks like this:

The type, field, and query_vector parameters are required at the diversify level: field tells MMR which dense_vector field to use for inter-result similarity, and query_vector provides the reference point for relevance scoring.

This lets you answer: "What does this cluster actually cover?" rather than just "What's at its center?"

The plain kNN results cluster around one angle of the topic: the documents most similar to the centroid and to each other. The diversify retriever surfaces different facets of the same cluster: subtopics, different sources, and varied perspectives.

The diversity metric confirms this quantitatively: the average pairwise similarity is lower for the diversify retriever results, meaning that the returned documents cover more ground.

This is useful for:

  • Understanding what a cluster actually covers, not just its center but also its edges.
  • Generating summaries. Diverse representative docs give an LLM better material.
  • Finding representative examples for human review or downstream labeling.
  • Quality checks. If the diverse results look incoherent, the cluster may need splitting.

Part 2: Temporal story chains

Tracking stories across days

Part 1 clustered the full month globally for topic discovery. For temporal flow, the same density-probed centroid classification runs independently per day on daily indices, and then clusters are linked across adjacent days. Note that the daily clusters are independent of the global clusters from Part 1; each day produces its own cluster assignments and labels tuned to that day's content.

The linking approach: sample-and-query

For each cluster on day A:

  1. Sample a few representative documents.
  2. Run kNN against day B's index.
  3. Count how many hits land in each day B cluster.
  4. If the hit fraction exceeds a threshold (kNN fraction ≥ 0.4), record a link.

This is fast (only a few docs per cluster are queried, not all of them) and uses Elasticsearch's native kNN, no external tools needed.

A kNN fraction of 100% means every sampled document from the source cluster landed in the same target cluster, the strongest possible cross-day link. Most links above are football-related, which makes sense: Premier League coverage runs daily with high topical consistency.

The score | operator | gedlingleague | striker | season link is an example of a niche local football cluster (Gedling is a non-league club) being absorbed into the broader Premier League cluster on the next day, a natural effect of daily reclustering at different granularity.

Building story chains

A story chain is a sequence of linked clusters across consecutive days.

Individual pairwise links tell you that Monday's "UK politics" cluster connects to Tuesday's. Chains reveal the full arc: a story that starts Monday, evolves through the week, and fades by Friday.

Chains are built greedily from links with a kNN fraction ≥ 0.4, meaning that at least 40% of sampled documents from the source cluster landed in a single target cluster. Starting from the earliest cluster, the algorithm always follows the strongest outgoing link.

The longest chain tracks Ukraine–Russia coverage for 19 consecutive days, unsurprising given the sustained geopolitical intensity in February 2025. The second-longest follows Premier League football across 19 days of the month. Shorter chains capture award season (film/awards, six days), Six Nations rugby (10 days), and UK political leadership coverage (seven days). Each chain represents a story arc that the algorithm discovered purely from embedding similarity across daily indices.

Sankey: Visualizing story flow

A Sankey diagram is a flow visualization where link width represents connection strength. Here, each vertical band is a day, each node is a daily cluster (sized by document count), and each colored path traces one story chain across time. Link width encodes kNN overlap strength: Thicker links mean more sampled documents landed in the target cluster. Colors are consistent per chain, so a single color path from left to right reads as one story's progression.

For example, the Ukraine–Russia chain (visible as one of the longer paths) flows continuously from early February through the third week, with consistently thick links indicating strong topical continuity across days.

Temporal story chains flowing across February 2025. Each colored path is a story persisting across days; link width indicates kNN overlap strength.

What this approach delivers

This walkthrough covered a complete unsupervised document clustering pipeline built on Elasticsearch:

  1. Clustering embeddings: Jina v5's task-specific adapters produce embeddings optimized for topic grouping, not just query-document matching.
  2. Global discovery clustering: Clustering the full month in one index maximizes cross-day topical discovery.
  3. Density-probed centroid classification: Sample 5%, probe density via msearch kNN, select diverse high-density seeds, classify all docs against centroids. Elasticsearch handles the heavy compute; only seed selection runs client-side (~0.01s).
  4. significant_text labeling: Significance testing produces meaningful cluster labels without any ML model or manual annotation. Clusters that produce no significant terms are incoherent and get demoted to noise — a built-in quality gate.
  5. Temporal story linking: Daily indices and sample-and-query cross-index kNN trace how stories evolve over time.

Key takeaways:

  • The embedding task type matters: Clustering embeddings produce measurably tighter topical groups.
  • Elasticsearch can serve as both the storage layer and the clustering engine via kNN search.
  • Density-probed centroid classification keeps nearly all compute server-side and produces clusters with natural sizes determined by embedding space density.
  • significant_text is fast, interpretable, and effective for both auto-labeling and quality gating.

When this approach is useful:

  • You have timestamped text and want topic discovery without labeled training data.
  • You want one stack for storage, vector search, labeling, and temporal linkage.

Extensions to explore:

  • Multi-period clustering (weekly, monthly rollups).
  • Real-time ingestion with incremental cluster assignment.
  • LLM-generated cluster summaries using the significant_text terms as seeds.
  • At larger scale, sampled KMeans centroids can serve as warm-start seeds for density-based clustering, reducing the probe phase cost.

Try it yourself

Swap in your own timestamped document corpus; any collection of text with dates works with this pipeline. The full notebook and supporting code are available in the companion repository.

このコンテンツはどれほど役に立ちましたか?

役に立たない

やや役に立つ

非常に役に立つ

関連記事

最先端の検索体験を構築する準備はできましたか?

十分に高度な検索は 1 人の努力だけでは実現できません。Elasticsearch は、データ サイエンティスト、ML オペレーター、エンジニアなど、あなたと同じように検索に情熱を傾ける多くの人々によって支えられています。ぜひつながり、協力して、希望する結果が得られる魔法の検索エクスペリエンスを構築しましょう。

はじめましょう