To improve any system you need to be able to measure how well it is doing. In the context of search BEIR (or equivalently the Retrieval section of the MTEB leaderboard) is considered the “holy grail” for the information retrieval community and there is no surprise in that. It’s a very well-structured benchmark with varied datasets across different tasks. More specifically, the following areas are covered:
It provides a single statistic, nDCG@10, related to how well a system matches the most relevant documents for each task example in the top results it returns. For a search system that a human interacts with relevance of top results is critical. However, there are many nuances to evaluating search that a single summary statistic misses.
Each benchmark has three artefacts:
qrels
).Relevance judgments are provided as a score which is zero or greater. Non-zero scores indicate that the document is somewhat related to the query.
Dataset | Corpus size | #Queries in the test set | #qrels positively labeled | #qrels equal to zero | #duplicates in the corpus |
---|---|---|---|---|---|
Arguana | 8,674 | 1,406 | 1,406 | 0 | 96 |
Climate-FEVER | 5,416,593 | 1,535 | 4,681 | 0 | 0 |
DBPedia | 4,635,922 | 400 | 15,286 | 28,229 | 0 |
FEVER | 5,416,568 | 6,666 | 7,937 | 0 | 0 |
FiQA-2018 | 57,638 | 648 | 1,706 | 0 | 0 |
HotpotQA | 5,233,329 | 7,405 | 14,810 | 0 | 0 |
Natural Questions | 2,681,468 | 3,452 | 4,021 | 0 | 16,781 |
NFCorpus | 3,633 | 323 | 12,334 | 0 | 80 |
Quora | 522,931 | 10,000 | 15,675 | 0 | 1,092 |
SCIDOCS | 25,657 | 1,000 | 4,928 | 25,000 | 2 |
Scifact | 5,183 | 300 | 339 | 0 | 0 |
Touche2020 | 382,545 | 49 | 932 | 1,982 | 5,357 |
TREC-COVID | 171,332 | 50 | 24,763 | 41,663 | 0 |
MSMARCO | 8,841,823 | 6,980 | 7,437 | 0 | 324 |
CQADupstack (sum) | 457,199 | 13,145 | 23,703 | 0 | 0 |
Table 1 presents some statistics for the datasets that comprise the BEIR
benchmark such as the number of documents in the corpus, the number of queries in the test dataset and the number of positive/negative (query, doc) pairs in the qrels
file. From a quick a look in the data we can immediately infer the following:
qrels
file, i.e. zero scores, which would explicitly denote documents as irrelevant to the given query.#qrels
/ #queries
) varies from 1.0 in the case of ArguAna
to 493.5 (TREC-COVID
) but with a value <
5 for the majority of the cases.ArguAna
we have identified 96 cases of duplicate doc pairs with only one doc per pair being marked as relevant to a query. By “expanding” the initial qrels list to also include the duplicates we have observed a relative increase of ~1% in the nDCG@10
score on average.{
"_id": "test-economy-epiasghbf-pro02b",
"title": "economic policy international africa society gender house believes feminisation",
"text": "Again employment needs to be contextualised with …",
"metadata": {}
}
{
"_id": "test-society-epiasghbf-pro02b",
"title": "economic policy international africa society gender house believes feminisation",
"text": "Again employment needs to be contextualised with …",
"metadata": {}
}
Example of duplicate pairs in ArguAna. In the qrels file only the first appears to be relevant (as counter-argument) to query (“test-economy-epiasghbf-pro02a”)
When comparing models on the MTEB leaderboard it is tempting to focus on average retrieval quality. This is a good proxy to the overall quality of the model, but it doesn't necessarily tell you how it will perform for you. Since results are reported per data set, it is worth understanding how closely the different data sets relate to your search task and rescore models using only the most relevant ones. If you want to dig deeper, you can additionally check for topic overlap with the various data set corpuses. Stratifying quality measures by topic gives a much finer-grained assessment of their specific strengths and weaknesses.
One important note here is that when a document is not marked in the qrels
file then by default it is considered irrelevant to the query. We dive a little further into this area and collect some evidence to shed more light on the following question: “How often is an evaluator presented with (query, document) pairs for which there is no ground truth information?".
The reason that this is important is that when only shallow markup is available (and thus not every relevant document is labeled as such) one Information Retrieval system can be judged worse than another just because it “chooses” to surface different relevant (but unmarked) documents. This is a common gotcha in creating high quality evaluation sets, particularly for large datasets. To be feasible manual labelling usually focuses on top results returned by the current system, so potentially misses relevant documents in its blind spots. Therefore, it is usually preferable to focus more resources on fuller mark up of fewer queries than broad shallow markup.
To initiate our analysis we implement the following scenario (see the notebook):
qrels
file.The list of reranking of models we used is the following:
rerank-english-v2.0
and rerank-english-v3.0
Retrieval | Reranking | |||||
---|---|---|---|---|---|---|
Dataset | BM25 (%) |
Cohere Rerank v2 (%) |
Cohere Rerank v3 (%) |
BGE-base (%) |
mxbai-rerank-xsmall-v1 (%) |
MiniLM-L-6-v2 (%) |
Arguana | 7.54 | 4.87 | 7.87 | 4.52 | 4.53 | 6.84 |
Climate-FEVER | 5.75 | 6.24 | 8.15 | 9.36 | 7.79 | 7.58 |
DBPedia | 61.18 | 60.78 | 64.15 | 63.9 | 63.5 | 67.62 |
FEVER | 8.89 | 9.97 | 10.08 | 10.19 | 9.88 | 9.88 |
FiQa-2018 | 7.02 | 11.02 | 10.77 | 8.43 | 9.1 | 9.44 |
HotpotQA | 12.59 | 14.5 | 14.76 | 15.1 | 14.02 | 14.42 |
Natural Questions | 5.94 | 8.84 | 8.71 | 8.37 | 8.14 | 8.34 |
NFCorpus | 31.67 | 32.9 | 33.91 | 30.63 | 32.77 | 32.45 |
Quora | 12.2 | 10.46 | 13.04 | 11.26 | 12.58 | 12.78 |
SCIDOCS | 8.62 | 9.41 | 9.71 | 8.04 | 8.79 | 8.52 |
Scifact | 9.07 | 9.57 | 9.77 | 9.3 | 9.1 | 9.17 |
Touche2020 | 38.78 | 30.41 | 32.24 | 33.06 | 37.96 | 33.67 |
TREC-COVID | 92.4 | 98.4 | 98.2 | 93.8 | 99.6 | 97.4 |
MSMARCO | 3.97 | 6.00 | 6.03 | 6.07 | 5.47 | 6.11 |
CQADupstack (avg.) | 5.47 | 6.32 | 6.87 | 5.89 | 6.22 | 6.16 |
From Table 2, with the exception of TREC-COVID
(>90% coverage), DBPedia
(~65%), Touche2020
and nfcorpus
(~35%), we see that the majority of the datasets have a labeling rate between 5% and a little more than 10% after retrieval or reranking. This doesn’t mean that all these unmarked documents are relevant but there might be a subset of them -especially those placed in the top positions- that could be positive.
With the arrival of general purpose instruction tuned language models, we have a new powerful tool which can potentially automate judging relevance. These methods are typically far too computationally expensive to be used online for search, but here we are concerned with offline evaluation. In the following we use them to explore the evidence that some of the BEIR datasets suffer from shallow markup.
In order to further investigate this hypothesis we decided to focus on MSMARCO and select a subset of 100 queries along with the top-5 reranked (with Cohere v2) documents which are currently not marked as relevant. We followed two different paths of evaluation: First, we used a carefully tuned prompt (more on this in a later post) to prime the recently released Phi-3-mini-4k model to predict the relevance (or not) of a document to the query. In parallel, these cases were also manually labeled in order to also assess the agreement rate between the LLM output and human judgment. Overall, we can draw the following two conclusions$\dag$:
Here, some examples drawn from the MSMARCO
/dev
dataset which contain the query, the annotated positive document (from qrels
) and a false negative document due to incomplete markup:
Example 1:
{
"query":
{
"_id": 155234,
"text": "do bigger tires affect gas mileage"
},
"positive_doc":
{
"_id": 502713,
"text": "Tire Width versus Gas Mileage. Tire width is one of the only tire size factors that can influence gas mileage in a positive way. For example, a narrow tire will have less wind resistance, rolling resistance, and weight; thus increasing gas mileage.",
},
"negative_doc":
{
"_id": 7073658,
"text": "Tire Size and Width Influences Gas Mileage. There are two things to consider when thinking about tires and their effect on gas mileage; one is wind resistance, and the other is rolling resistance. When a car is driving at higher speeds, it experiences higher wind resistance; this means lower fuel economy."
}
}
Example 2:
{
"query":
{
"_id": 300674,
"text": "how many years did william bradford serve as governor of plymouth colony?"
},
"positive_doc":
{
"_id": 7067032,
"text": "http://en.wikipedia.org/wiki/William_Bradford_(Plymouth_Colony_governor) William Bradford (c.1590 \u00e2\u0080\u0093 1657) was an English Separatist leader in Leiden, Holland and in Plymouth Colony was a signatory to the Mayflower Compact. He served as Plymouth Colony Governor five times covering about thirty years between 1621 and 1657."
},
"negative_doc":
{
"_id": 2495763,
"text": "William Bradford was the governor of Plymouth Colony for 30 years. The colony was founded by people called Puritans. They were some of the first people from England to settle in what is now the United States. Bradford helped make Plymouth the first lasting colony in New England."
}
}
Manually evaluating specific queries like this is a generally useful technique for understanding search quality that complements quantitive measures like nDCG@10. If you have a representative set of queries you always run when you make changes to search, it gives you important qualitative information about how performance changes, which is invisible in the statistics. For example, it gives you much more insight into the false results your search returns: it can help you spot obvious howlers in retrieved results, classes of related mistakes, such as misinterpreting domain-specific terminology, and so on.
Our result is in agreement with relevant research around MSMARCO
evaluation. For example, Arabzadeh et al. follow a similar procedure where they employ crowdsourced workers to make preference judgments: among other things, they show that in many cases the documents returned by the reranking modules are preferred compared to the documents in the MSMARCO qrels
file. Another piece of evidence comes from the authors of the RocketQA reranker who report that more than 70% of the reranked documents were found relevant after manual inspection.
$\dag$ Update - September 9th: After a careful re-evaluation of the dataset we identified 15 more cases of relevant documents, increasing their total number from 273 to 288
The pursuit for better ground truth is never-ending as it is very crucial for benchmarking and model comparison. LLMs can assist in some evaluation areas if used with caution and tuned with proper instructions
More generally, given that benchmarks will never be perfect, it might be preferable to switch from a pure score comparison to more robust techniques capturing statistically significant differences. The work of Arabzadeh et al. provides a nice of example of this where based on their findings they build 95% confidence intervals indicating significant (or not) differences between the various runs. In the accompanying notebook we provide an implementation of confidence intervals using bootstrapping.
From the end-user perspective it’s useful to think about task alignment when reading benchmark results. For example, for an AI engineer who builds a RAG pipeline and knows that the most typical use case involves assembling multiple pieces of information from different sources, then it would be more meaningful to assess the performance of their retrieval model on multi-hop QA datasets like HotpotQA instead of the global average across the whole BEIR benchmark
In the next blog post we will dive deeper into the use of Phi-3 as LLM judge and the journey of tuning it to predict relevance.
]]>In 8.13 we introduced scalar quantization to Elasticsearch. By using this feature an end-user can provide float vectors that are internally indexed as byte vectors while retaining the float vectors in the index for optional re-scoring. This means they can reduce their index memory requirement, which is its dominant cost, by a factor of four. At the moment this is an opt-in feature feature, but we believe it constitutes a better trade off than indexing vectors as floats. In 8.14 we will switch to make this our default. However, before doing this we wanted a systematic evaluation of the quality impact.
The multilingual E5-small is a small high quality multilingual passage embedding model that we offer out-of-the-box in Elasticsearch. It has two versions: one cross-platform version which runs on any hardware and one version which is optimized for CPU inference in the Elastic Stack (see here). E5 represents a challenging case for automatic quantization because the vectors it produces have low angular variation and are relatively low dimension compared to state-of-the-art models. If we can achieve little to no damage enabling int8 quantization for this model we can be confident that it will work reliably.
The purpose of this experimentation is to estimate the effects of scalar-quantized kNN search as described here across a broad range of retrieval tasks using this model. More specifically, our aim is to assess the performance degradation (if any) by switching from a full-precision to a quantized index.
For the evaluation we relied upon BEIR and for each dataset that we considered we built a full precision and an int8-quantized index using the default hyperparameters (m: 16
, ef_construction: 100
). First, we experimented with the quantized (weights only) version of the multilingual E5-small model provided by Elastic here with Table 1 presenting a summary of the nDCG@10 scores (k:10
, num_candidates:100
):
Dataset | Full precision | Int8 quantization | Absolute difference | Relative difference |
---|---|---|---|---|
Arguana | 0.37 | 0.362 | -0.008 | -2.16% |
FiQA-2018 | 0.309 | 0.304 | -0.005 | -1.62% |
NFCorpus | 0.302 | 0.297 | -0.005 | -1.66% |
Quora | 0.876 | 0.875 | -0.001 | -0.11% |
SCIDOCS | 0.135 | 0.132 | -0.003 | -2.22% |
Scifact | 0.649 | 0.644 | -0.005 | -0.77% |
TREC-COVID | 0.683 | 0.672 | -0.011 | -1.61% |
Average | -0.005 | -1.05% |
Overall, it seems that there is a slight relative decrease of 1.05% on average.
Next, we considered repeating the same evaluation process using the unquantized version of multilingual E5-small (see model card here) and Table 2 shows the respective results.
Dataset | Full precision | Int8 quantization | Absolute difference | Relative difference |
---|---|---|---|---|
Arguana | 0.384 | 0.379 | -0.005 | -1.3% |
Climate-FEVER | 0.214 | 0.222 | +0.008 | +3.74% |
FEVER | 0.718 | 0.715 | -0.003 | -0.42% |
FiQA-2018 | 0.328 | 0.324 | -0.004 | -1.22% |
NFCorpus | 0.31 | 0.306 | -0.004 | -1.29% |
NQ | 0.548 | 0.537 | -0.011 | -2.01% |
Quora | 0.882 | 0.881 | -0.001 | -0.11% |
Robust04 | 0.418 | 0.415 | -0.003 | -0.72% |
SCIDOCS | 0.134 | 0.132 | -0.003 | -1.49% |
Scifact | 0.67 | 0.666 | -0.004 | -0.6% |
TREC-COVID | 0.709 | 0.693 | -0.016 | -2.26% |
Average | -0.004 | -0.83% |
Again, we observe a slight relative decrease in performance equal to 0.83%. Finally, we repeated the exercise for multilingual E5-base and the performance decrease was even smaller (0.59%)
But this is not the whole story: The increased efficiency of the quantized HNSW indices and the fact that the original float vectors are still retained in the index allows us to recover a significant portion of the lost performance through rescoring. More specifically, we can retrieve a larger pool of candidates through approximate kNN search in the quantized index, which is quite fast, and then compute the similarity function on the original float vectors and re-score accordingly.
As a proof of concept, we consider the NQ dataset which exhibited a large performance decrease (2.01%) with multilingual E5-small. By setting k=15
, num_candidates=100
and window_size=10
(as we are interested in nDCG@10) we get an improved score of 0.539
recovering about 20% of the performance. If we further increase the num_candidates
parameter to 200 then we get a score that matches the performance of the full precision index but with faster response times. The same setup on Arguana leads to an increase from 0.379 to 0.382 and thus limiting the relative performance drop from 1.3% to only 0.52%
The results of our evaluation suggest that scalar quantization can be used to reduce the memory footprint of vector embeddings in Elasticsearch without significant loss in retrieval performance. The performance decrease is more pronounced for smaller vectors (multilingual E5-small produces vectors of size equal to 384 while E5-base gives 768-dimensional embeddings), but this can be mitigated through rescoring. We are confident that scalar quantization will be beneficial for most users and we plan to make it the default in 8.14.
]]>We talked before about the scalar quantization work we've been doing in Lucene. In this two part blog we will dig a little deeper into how we're optimizing scalar quantization for the vector database use case. This has allowed us to unlock very nice performance gains for int4 quantization in Lucene, as we'll discuss in the second part. In the first part we're going to dive into the details of how we did this. Feel free to jump ahead to learn about what this will actually mean for you. Otherwise, buckle your seatbelts, because Kansas is going bye-bye!
First of all, a quick refresher on scalar quantization.
Scalar quantization was introduced as a mechanism for accelerating inference. To date it has been mainly studied in that setting. For that reason, the main considerations were the accuracy of the model output and the performance gains that come from reduced memory pressure and accelerated matrix multiplication (GEMM) operations. Vector retrieval has some slightly different characteristics which we can exploit to improve quantization accuracy for a given compression ratio.
The basic idea of scalar quantization is to truncate and scale each floating point component of a vector (or tensor) so it can be represented by an integer. Formally, if you use $b$ bits to represent a vector component $x$ as an integer in the interval $[0,2^b-1]$ you transform it as follows
x \mapsto \left\lfloor \frac{(2^b-1)\text{clamp}(x,a,b)}{b-a} \right\rceil
where $\text{clamp}(\cdot,a,b)$ denotes $\min(\max(\cdot, a),b)$ and $\lfloor\cdot\rceil$ denotes round to the nearest integer. People typically choose $a$ and $b$ based on percentiles of the distribution. We will discuss a better approach later.
If you use int4 or 4 bit quantization then each component is some integer in the interval [0,15], that is each component takes one of only 16 distinct values!
In this part, we are going to describe in detail two specific novelties we have introduced:
Just to pause on point 1 for a second. What we will show is that we can continue to compute the dot product directly using integer arithmetic. At the same time, we can compute an additive correction that allows us to improve its accuracy. So we can improve retrieval quality without losing the opportunity of using extremely optimized implementations of the integer dot product. This translates to a clear cut win in terms of retrieval quality as a function of performance.
Most embedding models use either cosine or dot product similarity. The good thing is if you normalize your vectors then cosine (and even Euclidean) is equivalent to dot product (up to order). Therefore, reducing the quantization error in the dot product covers the great majority of use cases. This will be our focus.
The vector database use case is as follows. There is a large collection of floating point vectors from some black box embedding model. We want a quantization scheme which achieves the best possible recall when retrieving with query vectors which come from a similar distribution. Recall is the proportion of true nearest neighbors, those with maximum dot product computed with float vectors, which we retrieve computing similarity with quantized vectors. We assume we've been given the truncation interval $[a,b]$ for now. All vector components are snapped into this interval in order to compute their quantized values exactly as we discussed before. In the next part we will discuss how to optimize this interval for the vector database use case.
To movitate the following analysis, consider that for any given document vector, if we knew the query vector ahead of time , we could compute the quantization error in the dot product exactly and simply subtract it. Clearly, this is not realistic since, apart from anything else, the query vector is not fixed. However, maybe we can achieve a real improvement by assuming that the query vector is drawn from a distribution that is centered around the document vector. This is plausible since queries that match a document are likely to be in the vicinity of its embedding. In the following, we formalize this intuition and actually derive a correction term. We first study the error that scalar quantization introduces into the dot product. We then devise a correction based on the expected first order error in the vicinity of each indexed vector. To do this requires us to store one extra float per vector. Since realistic vector dimensions are large this results in minimal overhead.
We will call an arbitrary vector in our database $\mathbf{x}$ and an arbitrary query vector $\mathbf{y}$. Then
\begin{align*}
\mathbf{x}^t\mathbf{y} &= (\mathbf{a}+\mathbf{x}-\mathbf{a})^t(\mathbf{a}+\mathbf{x} - \mathbf{a}) \\
&= \mathbf{a}^t\mathbf{a}+\mathbf{a}^t(\mathbf{y}-\mathbf{a})+\mathbf{a}^t(\mathbf{x}-\mathbf{a})+(\mathbf{y}-\mathbf{a})^t(\mathbf{x}-\mathbf{a})
\end{align*}
On the right hand side, the first term is a constant and the second two terms are a function of a single vector and can be precomputed. For the one involving the document, this is an extra float that can be stored with its vector representation. So far all our calculations can use floating point arithmetic. Everything interesting however is happening in the last term, which depends on the interaction between the query and the document. We just need one more bit more notation: define $\alpha = \frac{b-a}{2^b-1}$ and $\star_q=\frac{\text{clamp}(\star, \mathbf{a}, \mathbf{b})}{\alpha}$ where we understand that the clamp function broadcasts over vector components. Let's rewrite the last term, still keeping everything in floating point, using a similar trick:
\begin{align*}
(\mathbf{y}-\mathbf{a})^t(\mathbf{x}-\mathbf{a}) &= (\alpha \mathbf{y}_q + \mathbf{y} -\mathbf{a} - \alpha\mathbf{y}_q)^t(\alpha \mathbf{x}_q + \mathbf{x} -\mathbf{a} - \alpha\mathbf{x}_q) \\
&= \alpha^2\mathbf{y}_q^t\mathbf{x}_q - \alpha\mathbf{y}_q^t\mathbf{\epsilon}_x + \alpha\mathbf{x}_q^t\mathbf{\epsilon}_y + \text{O}(\|\mathbf{\epsilon}\|^2)
\end{align*}
Here, $\mathbf{\epsilon}_{\star} = \star - \mathbf{a}- \star_q$ represents the quantization error. The first term is just the scaled quantized vector dot product and can be computed exactly. The last term is proportional in magnitude to the square of the quantization error and we hope this will be somewhat small compared to the overall dot product. That leaves us with the terms that are linear in the quantization error.
We can compute the quantization error vectors in the query and document, $\mathbf{\epsilon}_y$ and $\mathbf{\epsilon}_x$ respectively, ahead of time. However, we don't actually know the value of $\mathbf{x}$ we will be comparing to a given $\mathbf{y}$ and vice versa. So we don't know how to calculate the error in the dot product quantities exactly. In such cases it is natural to try and minimize the error in expectation (in some sense we discuss below).
If $\mathbf{x}$ and $\mathbf{y}$ are drawn at random from our corpus they are random variables and so too are $\mathbf{x}_q$ and $\mathbf{y}_q$. For any distribution we average over for $\mathbf{x}_q$ then $\mathbb{E}_x[\alpha\mathbf{x}_q^t \mathbf{\epsilon}_y]=\alpha\mathbb{E}_x[\mathbf{x}_q]^t \mathbf{\epsilon}_y$, since $\alpha$ and $\mathbf{\epsilon}_y$ are fixed for a query. This is a constant additive term to the score of each document, which means it does not change their order. This is important as it will not change the quality of retrieval and so we can drop it altogether.
What about the $\alpha\mathbf{y}_q^t \mathbf{\epsilon}_x$ term? The naive thing is to assume that the $\mathbf{y}_q^t$ is a random sample from our corpus. However, this is not the best assumption. In practice, we know that the queries which actually match a document will come from some region in the vicinity of its embedding as we illustrate in the figure below.
<cite> Schematic query distribution that is expected to match a given document embedding (orange) vs all queries (blue plus orange). </cite>
We can efficiently find nearest neighbors of each document $\mathbf{x}$ in the database once we have a proximity graph. However, we can do something even simpler and assume that for relevant queries $\mathbb{E}_y[\mathbf{y}_q] \approx \mathbf{x}_q$. This yields a scalar correction $\alpha\mathbf{x}_q^t\mathbf{\epsilon}_x$ which only depends on the document embedding and can be precomputed and added on to the $\mathbf{a}^t(\mathbf{x}-\mathbf{a})$ term and stored with the vector. We show later how this affects the quality of retrieval. The anisotropic correction is inspired by this approach for reducing product quantization errors.
Finally, we note that the main obstacle to improving this correction is we don't have useful estimates of the joint distribution of the query and document embedding quantization errors. One approach that might enable this, at the cost of some extra memory and compute, is to use low rank approximations of these errors. We plan to study schemes like this since we believe they could unlock accurate general purpose 2 or even 1 bit scalar quantization.
So far we worked with some specified interval $[a,b]$ but didn't discuss how to compute it. In the context of quantization for model inference people tend to use quantile points of the component distribution or their minimum and maximum. Here, we discuss a new method for computing this based on the idea that preserving the order of the dot product values is better suited to the vector database use case.
First off, why is this not equivalent to minimizing the magnitude of the quantization errors? Suppose for a query $\mathbf{y}$ the top-k matches are $\mathbf{x}i$ for $i \in [k]$. Consider two possibilities, the quantization error is some constant $c$, or it is is normally distributed with mean $0$ and standard deviation $\frac{c}{10}$. In the second case the expected error is roughly 10 times smaller than the first. However, the first effect is a constant shift, which preserves order and has no impact on recall. Meanwhile, if $\frac{1}{k}\sum{i=1}^k \left|\mathbf{y}^t(\mathbf{x}i-\mathbf{x}{i+1})\right| < \frac{c}{10}$ it is very likely the random error will reorder matches and so affect the quality of retrieval.
Let's use the previous example to better motivate our approach. The figure below shows the various quantities at play for a sample query $\mathbf{y}$ and two documents $\mathbf{x}i$ and $\mathbf{x}{i+1}$. The area of each blue shaded rectangle is equal to one of the floating point dot products and the area of each red shaded rectangle is equal to one of the quantization errors. Specifically, the dot products are $|\mathbf{y}||P_y\mathbf{x}i|$ and $|\mathbf{y}||P_y\mathbf{x}{i+1}|$, and the quantization errors are $|\mathbf{y}||P_y(\mathbf{x}i-\mathbf{a}-\mathbf{x}{i,q})|$ and $|\mathbf{y}||P_y(\mathbf{x}{i+1}-\mathbf{a}-\mathbf{x}{i+1,q})|$ where $P_y=\frac{\mathbf{y}\mathbf{y}^t}{|\mathbf{y}|^2}$ is the projection onto the query vector. In this example the errors preserve the document order. This follows because the right blue rectangle (representing the exact dot product) and union of right blue and red rectangles (representing the quantized dot product) are both larger than the left ones. It is visually clear the more similar the left and right red rectangles the less likely it is the documents will be reordered. Conversely, the more similar the left and right blue rectangles the more likely it is that quantization will reorder them.
<cite> Schematic of the dot product values and quantization error values for a query and two near neighbor documents. In this case, the document order is preserved by quantization. </cite>
One way to think of the quantized dot product is that it models the floating point dot product. From the previous discussion we want to minimize the variance of this model's residual error, which should be as similar as possible for each document. However, there is a second consideration: the density of the floating point dot product values. If these values are close together it is much more likely that quantization will reorder them. It is quite possible for this density to change from one part of the embedding space to another and higher density regions are more sensitive to quantization errors.
A natural measure which captures both the quantization error variance and the density of the dot product values is the coefficient of determination of the quantized dot product with respect to the floating point dot product. A good interval $[a,b]$ will maximize this in expectation over a representative query distribution. We need a reasonable estimator for this quantity for the database as a whole that we can compute efficiently. We found the following recipe is both fast and yields an excellent choice for parameters $a$ and $b$:
This optimization problem can be solved by any black box solver. For example, we used a variant of the adaptive LIPO algorithm in the following. Furthermore, we found that our optimization objective was well behaved (low Lipschitz constant) for all data sets we tested.
Before deciding to implement this scheme for real we studied how it behaves with int4 quantization.
Below we show results for two data sets that are fairly typical of passage embedding model distributions on real data. To generate these we use e5-small-v2 and Cohere's multilingual-22-12 models. These are both fairly state-of-the-art text embedding models. However, they have rather different characteristics. The e5-small-v2 model uses cosine similarity, its vectors have 384 dimensions and very low angular variation. The multilingual-22-12 model uses dot product similarity, its vectors have 768 dimensions and it encodes information in their length. They pose rather different challenges for our quantization scheme and improving both gives much more confidence it works generally. For e5-small-v2 we embedded around 500K passages and 10K queries sampled from the MS MARCO passage data set. For multilingual-22-12 we used around 1M passages and 1K distinct passages for queries sampled from the English Wikipedia data set.
First of all, it is interesting to understand the accuracy of the int4 dot product values. The figure below shows the int4 dot product values compared to their float values for a random sample of 100 documents and their 10 nearest neighbors taken from the set we use to compute the optimal truncation interval for e5-small-v2. The orange “best fit” line is $y=x-0.017$. Note that this underlines the fact that this procedure can pick a biased estimator if it reduces the residual variance: in this case the quantized dot product is systematically underestimating the true dot product. However, as we discussed before, any constant shift in the dot product is irrelevant for ranking. For the full 1k samples we achieve an $R^2$ of a little less than 0.995, i.e. the int4 quantized dot product is a very good model of the float dot product!
<cite> Comparison of int4 dot product values to the corresponding float values for a random sample of 100 documents and their 10 nearest neighbors. </cite>
While this is reassuring, what we really care about is the impact on retrieval quality. Since one can implement brute force nearest neighbor search in a few lines, it allows us to quickly test the impact of our design choices on retrieval. In particular, we are interested in understanding the expected proportion of true nearest neighbors we retrieve when computing similarities using int4 quantized vectors. Below we show the results for an ablation study of the dot product correction and interval optimization.
In general, one can boost the accuracy of any quantization scheme by gathering more than the requested vector count and reranking them using their floating point values. However, this comes with a cost: it is significantly more expensive to search graphs for more matches and the floating point vectors must be loaded from disk or it defeats the purpose of compressing them. One way of comparing alternatives is therefore to understand how many vectors must be reranked to achieve the same recall. The lower this number is, the better. In the figures below we show average recall curves as a function of the number of candidates we rerank for different combinations of the two improvements we have discussed:
Note that we used $d$ to denote the vector dimension.
<cite> Average recall@10 curves for e5-small-v2 embeddings as a function of the number of candidates reranked for different combinations of the two improvements we discussed. </cite>
<cite> Average recall@10 curves for multilingual-22-12 embeddings as a function of the number of candidates reranked for different combinations of the two improvements we discussed. </cite>
For e5-small-v2 embeddings we roughly halve the number of vectors we need to rerank to achieve 95% recall compared to the baseline. For multilingual-22-12 embeddings we reduce it by closer to a factor of three. Interestingly, the impact of the two improvements is different for the different data sets. For e5-small-v2 embeddings applying the linear correction has a significantly larger effect than optimizing the interval $[a,b]$ whilst the converse is true for multilingual-22-12 embeddings. Another important observation is the gains are more significant if one wants to achieve very high recall: to achieve close to 99% recall one has to rerank at least 5 times as many vectors for both data sets in the baseline versus our improved quantization scheme.
We have discussed the theoretical and empirical motivation behind two novelties we introduced to achieve high quality int4 quantization, as well as some preliminary results that indicate it'll be an effective general purpose scheme for in memory vector storage for retrieval. This is all well and good, but how well does it work in a real vector database implementation? In the companion blog we discuss our implementation in Lucene and compare it to other storage options such as floating point and int7, which Lucene also provides.
]]>In our previous blogs, we walked through the implementation of scalar quantization as a whole in Lucene. We also explored two specific optimizations for quantization. Now we've reached the question: how does int4
quantization work in Lucene and how does it line up?
Int4
quantization work in LuceneLucene stores all the vectors in a flat file, making it possible for each vector to be retrieved given some ordinal. You can read a brief overview of this in our previous scalar quantization blog.
Nowint4
gives us additional compression options than what we had before. It reduces the quantization space to only 16 possible values (0 through 15). For more compact storage, Lucene uses some simple bit shift operations to pack these smaller values into a single byte, allowing a possible 2x space savings on top of the already 4x space savings with int8. In all, storing int4 with bit compression is 8x smaller than float32
.
<cite>
Figure 1: This shows the reduction in bytes required with int4
which allows an 8x reduction in size from float32
when compressed.
</cite>
int4
also has some benefits when it comes to scoring latency. Since the values are known to be between 0-15
, we can take advantage of knowing exactly when to worry about value overflow and optimize the dot-product calculation. The maximum value for a dot product is 15*15=225
which can fit in a single byte. ARM processors (like my macbook) have a SIMD instruction length of 128 bits (16 bytes). This means that for a Java short
we can allocate 8 values to fill the lanes. For 1024 dimensions, each lane will end up accumulating a total of 1024/8=128
multiplications that have a max value of 225
. The resulting maximum sum of 28800
fits well within the limit of Java's short
value and we can iterate more values at a time than. Here is some simplified code of what this looks like for ARM.
// snip preamble handling vectors longer than 1024
// 8 lanes of 2 bytes
ShortVector acc = ShortVector.zero(ShortVector.SPECIES_128);
for (int i = 0; i < length; i += ByteVector.SPECIES_64.length()) {
// Get 8 bytes from vector a
ByteVector va8 = ByteVector.fromArray(ByteVector.SPECIES_64, a, i);
// Get 8 bytes from vector b
ByteVector vb8 = ByteVector.fromArray(ByteVector.SPECIES_64, b, i);
// Multiply together, potentially saturating signed byte with a max of 225
ByteVector prod8 = va8.mul(vb8);
// Now convert the product to accumulate into the short
ShortVector prod16 = prod8.convertShape(B2S, ShortVector.SPECIES_128, 0).reinterpretAsShorts();
// Ensure to handle potential byte saturation
acc = acc.add(prod16.and((short) 0xFF));
}
// snip, tail handling
For a more detailed explanation of the error correction calculation and its derivation, please see error correcting the scalar dot-product.
Here is a short summary, woefully (or joyfully) devoid of complicated mathematics.
For every quantized vector stored, we additionally keep track of a quantization error correction. Back in the Scalar Quantization 101 blog there was a particular constant mentioned:
\alpha \times int8_i \times min
This constant is a simple constant derived from basic algebra. However, we now include additional information in the stored float that relates to the rounding loss.
\sum_{i=0}^{dim-1} ((i - min) - i'\times\alpha)i'\times\alpha
Where $$i$$ is each floating point vector dimension, $$i'$$ is the scalar quantized floating point value, and $$\alpha=\frac{max - min}{(1 << bits) - 1}$$.
This has two consequences. The first is intuitive, as it means that for a given set of quantization buckets, we are slightly more accurate as we account for some of the lossiness of the quantization. The second consequence is a bit more nuanced. It now means we have an error correction measure that is impacted by the quantization bucketing. This implies that it can be optimized.
The naive and simple way to do scalar quantization can get you pretty far. Usually, you pick a confidence interval from which you calculate the allowed extreme boundaries for vector values. The default in Lucene and consequently Elasticsearch is $$1-1/(dimensions+1)$$. Figure 2 shows the confidence interval over some sampled CohereV3 embeddings. Figure 3 shows the same vectors, but scalar quantized with that statically set confidence interval.
<cite> Figure 2: A sampling of CohereV3 dimension values. </cite>
<cite> Figure 3: CohereV3 dimension values quantized into int7 values. What are those spikes at the end? Well, that is the result of truncating extreme values during the quantization process. </cite>
But, we are leaving some nice optimizations on the floor. What if we could tweak the confidence interval to shift the buckets, allowing for more important dimensional values to have higher fidelity. To optimize, Lucene does the following:
1024
dimensions would search quantile candidates between confidence intervals 0.99902
and 0.90009
.
<cite>
Figure 3: Lucene searching the confidence interval space and testing various buckets for int4
quantization.
</cite>
<cite>
Figure 4: The best int4
quantization buckets found for this CohereV3 sample set.
</cite>
For a more complete explanation of the optimization process and the mathematics behind this optimization, see optimizing the truncation interval.
As I mentioned before, int4
gives you an interesting tradeoff between performance and space. To drive this point home, here are some memory requirements for CohereV3 500k vectors.
<cite> Figure 5: Memory requirements for CohereV3 500k vectors. </cite>
Of course, we see the typical 4x reduction in regular scalar quantization, but then additional 2x reduction with int4
. Moving the required memory from 2GB
to less than 300MB
. Keep in mind, this is with compression enabled. Decompressing and compressing bytes does have an overhead at search time. For every byte vector, we must decompress them before doing the int4
comparisons. Consequently, when this is introduced in Elasticsearch, we want to give users the ability to choose to compress or not. For some users, the cheaper memory requirements are just too good to pass up, for others, their focus is speed. Int4
gives the opportunity to tune your settings to fit your use-case.
<cite> Figure 6: HNSW graph search speed comparison for CohereV3 500k vectors. </cite>
Figure 6 is a bit disappointing in terms of the speed of compressed scalar quantization. We expect performance benefits from loading fewer bytes to the JVM heap. However, this is being outweighed by the cost of decompressing them. This caused us dig deeper. The reason for the performance impact was naively decompressing the bytes separately from the dot-product comparison. This is a mistake. We can do better.
Consequently, we can use SIMD to decompress the bytes and compare them in the same function. This is a bit more complicated than the previous SIMD example, but it is possible. Here is a simplified version of what this looks like for ARM.
// the packed vector, each byte contains two values
// for packed value `n`: packed[n] = (raw[n] << 4) | raw[packed.length + n];
ByteVector vb8 = ByteVector.fromArray(ByteVector.SPECIES_64, packed, i + j);
// unpacked, the raw query vector int4 quantized
ByteVector va8 = ByteVector.fromArray(ByteVector.SPECIES_64, unpacked, i + j + packed.length);
// upper side, decompress and multiply
ByteVector prod8 = vb8.and((byte) 0x0F).mul(va8);
ShortVector prod16 = prod8.convertShape(B2S, ShortVector.SPECIES_128, 0).reinterpretAsShorts();
// Ensure to handle potential byte saturation
acc0 = acc0.add(prod16.and((short) 0xFF));
// lower side, decompress and multiply
va8 = ByteVector.fromArray(ByteVector.SPECIES_64, unpacked, i + j);
prod8 = vb8.lanewise(LSHR, 4).mul(va8);
prod16 = prod8.convertShape(B2S, ShortVector.SPECIES_128, 0).reinterpretAsShorts();
// Ensure to handle potential byte saturation
acc1 = acc1.add(prod16.and((short) 0xFF));
As expected, this has a significant improvement on ARM. Effectively removing all performance discrepancies on ARM between compressed and uncompressed scalar quantization.
<cite> Figure 7: HNSW graph search comparison with int4 quantized vectors over 500k Coherev3 vectors. This is on ARM architecture. </cite>
Over the last two large and technical blog posts, we've gone over the math and intuition around the optimizations and what they bring to Lucene. It's been a long ride, and we are nowhere near done. Keep an eye out for these capabilities in a future Elasticsearch version!
]]>As we have described before Lucene's and hence Elasticsearch's approximate kNN search is based on searching an HNSW graph for each index segment and combining results from all segments to find the global k nearest neighbors. When it was first introduced a multi-graph search was done sequentially in a single thread, searching one segment after another. This comes with some performance penalty because searching a single graph is sublinear in its size. In Elasticsearch 8.10 we parallelized vector search, allocating up to a thread per segment in kNN vector searches, if there are sufficient available threads in the threadpool. Thanks to this change, we saw query latency drop to half its previous value in our nightly benchmark.
Even though we were searching segment graphs in parallel, they were still independent searches, each collecting its own top k results unaware of progress made by other segment searches. We knew from our experience with the lexical search that we could achieve significant search speedups by exchanging information about best results collected so far among segment searches and we thought we could apply the same sort of idea for vector search.
When graph based indexes such as HNSW search for nearest neighbors to a query vector one can think of their strategy as a combination of exploration and exploitation. In the case of HNSW this is managed by gathering a larger top-n match set than the top-k which it will eventually return. The search traverses every edge whose end vector is competitive with the worst match found so far in the expanded set. This means it explores parts of the graph which it already knows are not competitive and will never be returned. However, it also allows the search to escape local minima and ultimately achieve better recall. By contrast a pure exploitation approach simply seeks to decrease the distance to the kth best match at every iteration and will only traverse edges whose end vectors will be added to the current top-k set.
So the size of the expanded match set is a hyperparameter which allows one to trade run time for recall by increasing or decreasing exploration in the proximity graph.
As we discussed already, Lucene builds multiple graphs for different partitions of the data. Furthermore, at large scale, data must be partitioned and separate graphs built if one wants to scale retrieval horizontally over several machines. Therefore, a generally interesting question is "how should one adapt this strategy in the case that several graphs are being searched simultaneously for nearest neighbors?"
Recall is significantly higher when one searches graphs independently and combines the top-k sets from each. This makes sense through the lens of exploration vs exploitation: the multi-graph search is exploring many more vectors and so is much less likely to get trapped in local minima for the similarity function. However, it pays a cost to do this in increased run time. Ideally, we would like recall to be more independent of the sharding strategy and search to be faster.
There are two factors which impact the efficiency of search on multiple graphs vs a single graph: the edges which are present in the single graph and having multiple independent top-n sets. In general, unless the vectors are partitioned into disjoint regions the neighbors of a vector in each partition graph will only comprise a subset of the true nearest neighbors in the single graph. This means one pays a cost in exploring non-competitive neighbors when searching multiple graphs. Since graphs are built independently, one necessarily has to pay a “structural” cost from having several graphs. However, as we shall see we can mitigate the second cost by intelligently sharing state between searches.
Given a shared global top-n set it is natural to ask how we should search portions of graphs that are uncompetitive, specifically, edges whose end vertices that are further than the nth worst global match. If we were searching a single graph these edges would not be traversed. However, we have to bear in mind that the different searches have different entry points and progress at different rates, so if we apply the same condition to multi-graph search it is possible that the search will stop altogether before we visit its closest neighbors to the query. We illustrate this in the figure below.
<cite> Figure 1 Two graph fragments showing a snapshot of a simultaneous search gathering the top-2 set. In this case if we were to prune edges whose unvisited end vertices are not globally competitive we would never traverse the red dashed edge and fail to find the best matches which are all in Graph 2. </cite>
To avoid this issue we devised a simple approach that effectively switches between different parameterizations of each local search based on whether it is globally competitive or not. To achieve this, as well as the global queue which is synchronized periodically, we maintain two local queues of the distances to the closest vectors to the query found for the local graph. One has size n and the other has size $\lfloor g \times n \rfloor$. Here, $g$ controls the greediness of non-competitive search and is some number less than 1. In effect, $g$ is a free parameter we can use to control recall vs the speed up.
As the search progresses we check two conditions when deciding whether to traverse an edge: i) would we traverse the edge if we were searching the graph alone, ii) is the end vertex globally competitive or is it locally competitive with the "greedy" best match set. Formally, if we denote the query vector $q$, the end vector of the candidate edge $v_e$, the $n^{\text{th}}$ local best match $v_n$, the $\lfloor g \times n\rfloor^{\text{th}}$ local best match $v_g$ and the $n^{\text{th}}$ global best match $v_{gb}$ then this amounts to adding $v_e$ to the search set if
d(v_e, q) < d(v_n, q)\text{ AND }(d(v_e, q) < d(v_g, q)\text{ OR }d(v_e, q) < d(v_{gb}, q))
Here, $d(\cdot,\cdot)$ denotes the index distance metric. Note that this strategy ensures we always continue searching each graph to any local minimum and depending on the choice of $g$ we still escape some local minima.
Modulo some details around synchronization, initialization and so on, this describes the change to the search. As we show this simple approach yields very significant improvements in search latency together with recall which is closer, but still better, than single graph search.
Our nightly benchmarks showed up to 60% faster vector search queries that run concurrent with indexing (average query latencies dropped from 54 ms to 32 ms).
<cite> Figure 2 Query latencies that run concurrently with indexing dropped significantly after upgrading to Lucene 9.10, which contains the new changes. </cite>
On queries that run outside of indexing we observed modest speedups, mostly because the dataset is not that big, containing 2 million vectors of 96 dims across 2 shards (Figure 3). But still for those benchmarks, we could see a significant decrease in the number of visited vertices in the graph and hence the number of vector operations (Figure 4).
<cite> Figure 3 Whilst we see small drops in the latencies after the change for queries that run without concurrent indexing, particularly for retrieving the top-100 matches, the number of vector operations (Figure 4) is dramatically reduced. </cite>
<cite> Figure 4 We see very significant decreases in the number of vector operations used to retrieve the top-10 and top-100 matches. </cite>
The speedups should be clearer for larger indexes with higher dimension vectors: in testing we typically saw between $2\times$ to $3\times$, which is also consistent with the reduction in the number of vector comparisons we see above. For example, we show below the speedup in vector search operations on the Lucene nightly benchmarks. These use vectors of 768 dimensions. It is worth noting that in the Lucene benchmarks the vector search runs in a single thread sequentially processing one graph after another, but the change positively affects this case as well. This happens because the global top-n set collected after first graph searches sets up the threshold for subsequent graph searches and allows them to finish earlier if they don't contain competitive candidates.
<cite> Figure 5 The graph shows that with the change committed on Feb 7th, the number queries per second increased from 104 queries/sec to 219 queries/sec. </cite>
The multi-graph search speedups come at the expense of slightly reduced recall. This happens because we may stop exploration of a graph that may still have good matches based on the global matches from other graphs. Two notes on the reduced recall: i) From our experimental results we saw that the recall is still higher than the recall of a single graph search, as if all segments were merged together into a single graph (Figure 6). ii) Our new approach achieves better performance for the same recall: it Pareto dominates our old multi-graph search strategy (Figure 7).
<cite> Figure 6 We can see the recall of kNN search on multiple segments slightly dropped for both top-10 and top-100 matches, but in both cases it is still higher than the recall of kNN search on a single merged segment. </cite>
<cite> Figure 7 The Queries Per Second is better in the candidate (with the current changes) than the baseline (old multi-graph search strategy) for the 10 million documents of the Cohere/wikipedia-22-12-en-embeddings dataset for each equivalent recall. </cite>
In this blog we showed how we achieved significant improvements in Lucene vector search performance while still achieving excellent recall by intelligently sharing information between the different graph searches. The improvement is a part of the Lucene 9.10 release and is a part of the Elasticsearch 8.13 release.
We're not done yet with improvements to our handling of multiple graphs in Lucene. As well as further improvements to search, we believe we've found a path to achieve dramatically faster merge times. So stay tuned!
]]>All these experiments and discoveries invariably revolve around measuring the model's performance on a given task. Unfortunately, assessing the quality of generated text poses a significant challenge, given its inherent openness and permissiveness. In a “search” scenario, there exists an “ideal” document ranking, which allows for straightforward comparisons to gauge how closely one aligns with this ideal ranking. However, when it comes to evaluating the quality of generated text in terms of answering questions or summarizing content, the task becomes considerably more complex.
In this blog, our primary focus will be on RAG (Retrieval-Augmented Generation) Question Answering tasks and more specifically on closed-domain QA. We will delve into some of those various metrics commonly employed in this field. We will thoroughly explore these metrics and explain the decisions made at Elastic to effectively monitor model performance.
In this family of metrics, the idea is to check how similar generated text is from “ground truth”. There are many variations based on this idea, and we will only talk about a few of them.
While these metrics serve as valuable tools for quick and straightforward evaluation of LLMs, they have certain limitations that render them less than ideal. To begin with, they fall short when it comes to assessing the fluency, coherence, and overall meaning of passages. They are also relatively insensitive to word order. Furthermore, despite METEOR's attempts to address this issue through synonyms and stemming, these evaluation tools lack semantic knowledge, making them blind to semantic variations. This problem is particularly acute in assessing long texts effectively, as treating text as a mere bag of passages is overly simplistic. Additionally, reliance on 'template answers' makes them expensive to use in large-scale evaluations and introduces bias toward the exact phraseology used in the template. Lastly, for specific tasks, studies have revealed that the correlation between BLEU and ROUGE scores and human judgments is actually quite low. For these reasons, researchers have tried to find improved metrics.
Perplexity (often abbreviated as PPL) stands as one of the most common metrics for assessing Language Models (LLMs). Calculating perplexity necessitates having access to the probability distribution for each word generated by your model. It is a measure of how confidently the model is able to predict the sequence of words. The higher the perplexity, the less confidently the model predicts the observed sequence. Formally, it is defined as follows:
PPL_{model}=\exp\left(-\sum_{i=1}^t \log(P_{\theta}(x_i|x_j\neq x_i))\right)
where $log(P_{\theta}(x_i|x_j\neq x_i))$ is the log predicted probability of the $i^{th}$ token conditioned on other tokens (!= i) in the sentence according to the model. To illustrate, below is an example of how perplexity is computed for a model with a vocabulary of just three words.
<p align = "center">Fig.2 - Perplexity score example</p>A significant benefit of perplexity is its speed to compute because it solely relies on output probabilities and doesn't involve an external model. Additionally, it tends to have a strong correlation with the quality of a model (although this correlation may vary depending on the test dataset being used).
Nonetheless, perplexity comes with certain limitations that can pose challenges. Firstly, it relies on the information density of the model, making it difficult to compare two models that differ in vocabulary size or context length. It is also impossible to compare scores between datasets, as some evaluation data may have intrinsically higher perplexity than others. Moreover, it can be overly sensitive to vocabulary discrepancies, potentially penalizing models for expressing the same answer differently, even when both versions are valid. Lastly, perplexity isn't well-suited for evaluating a model's ability to handle language ambiguity, creativity, or hallucinations. On ambiguity especially, words which are poorly determined by the rest of the sequence push up perplexity, but they aren’t indicators of poor generation or understanding. It could potentially penalize a model which understands ambiguity better than a less capable one. Due to these shortcomings, the NLP community has explored more advanced extrinsic metrics to address these issues.
Intrinsic and N-gram metrics have a significant drawback in that they don't leverage semantic understanding to evaluate the accuracy of generated content. Consequently, they may not align as closely with human judgement as we want. Model-based metrics have emerged as a more promising solution to tackle this issue.
BERTScore and BLEURT could be seen essentially as n-gram recalls but using contextual representation. BARTScore, on the other end, is closer to a perplexity measurement, between a target and a generated text, using a critic model rather than the model itself. While these model-based metrics offer strong evaluation capabilities, they are slower than BLEU or PPL because they involve external models. The relatively low correlation between BLEU and human judgement in many generation contexts mean this trade-off is justified. Simple similarity based metrics are still popular for selecting LLMs (as seen in the Hugging Face leaderboard). This approach might serve as a reasonable proxy, but given the capabilities of current state of the art LLMs it is not sufficient.
UniEval unifies all evaluation dimensions into a Boolean Question Answering framework, allowing a single model to assess a generated text from various angles. For example, if one of the evaluation dimensions is relevance, then one would ask the model directly "Is this a relevant answer to this question ?”. Given a set of tasks determined by the evaluation dimensions, a model is trained which is able to evaluate generated text with respect to those dimensions. Employing T5 as the foundational model, UniEval uses a two-step training process. The initial step, called “intermediate multitask learning”, utilizes the query and context to address multiple tasks unified as boolean QA tasks from pre-existing relevant datasets. Subsequently, the second step entails sequential training, wherein the model learns, dimension by dimension, how to evaluate distinct aspects of the generated text. The pre-trained UniEval model is geared towards summarization, yet we think that RAG Question Answering can be viewed as an aggressive summarization task when avoiding parametric memory for accurate responses. It has been trained across the following dimensions:
While UniEval is quite powerful, it doesn't currently hold the title of a “state-of-the-art” evaluation model, as of our writing. It seems that a GPT-based evaluator, like G-Eval, could potentially show a stronger correlation with human judgment than UniEval (only in the context of a GPT-4 based evaluator). However, it's essential to consider the significant cost difference. UniEval is an 800-million-parameter model, whereas GPT-4 is estimated to boast a massive 1.76 trillion parameters. We firmly believe that the slight advantage seen with G-Eval-4 isn't justified by the substantial increase in cost.
We're just starting to explore UniEval, and we intend to incorporate it into numerous exciting projects involving text generation in the future. However, armed with this evaluation model, we decided to test its capabilities by addressing three specific questions.
Can we easily compare LLMs quality with UniEval?
This is likely the first consideration that comes to mind when you have an evaluation metric. Is it an effective tool for predicting the quality of LLMs out there? We conducted a benchmark test on Mistral-7b-Instruct and Falcon-7b-Instruct to assess how distinguishable these two models are in terms of fluency, consistency, coherence, and relevance. For this benchmarking, we employed 200 queries from 18 datasets, ensuring a diverse range of contexts (including BioASQ, BoolQ, CoQA, CosmosQA, HAGRID, HotpotQA, MSMARCO, MultiSpanQA, NarrativeQA, NewsQA, NQ, PopQA, QuAC, SearchQA, SleepQA, SQuAD, ToolQA, TriviaQA, TruthfulQA). The prompt given to Mistral/Falcon includes both the query and the context containing the information needed to answer the query.
<p align = "center">Fig.5 - UniEval evaluation of Mistral and Falcon with distribution of scores of 3600 queries. Higher score is better.</p>In this specific instance, it's evident that Mistral outperforms Falcon across all evaluation dimensions, making the decision quite straightforward. However, it may be more challenging in other cases, particularly when deciding between relevance and consistency, both of which are crucial for Question Answering with RAG.
Is “consistency score” correlated with the number of hallucinations generated by a model?
The experiment is straightforward. We gather about 100 queries from the SQuAD 2.0 datasets. Next, we assess a model (specifically Mistral-7B-Instruct-v0.1 in this instance, but it could be any model) using UniEval. Following that, we manually examine and annotate the generated texts that exhibit hallucinations. Afterward, we create a calibration curve to check if the “consistency score” serves as a reliable predictor for the probability of hallucinations. In simpler terms, we're investigating whether the “consistency score” and the number of hallucinations are correlated.
<p align = "center">Fig.6 - Calibration curve for consistency on hallucinations detection</p>As observed, consistency turns out to be a reliable indicator of the probability of hallucinations, although it's not flawless. We've encountered situations where hallucinations are subtle and challenging to identify. Additionally, the model we tested occasionally provided correct answers that didn't originate from the prompt's context but rather from its parametric memory. In terms of the consistency metric, this resembles a hallucination, even though the answer is accurate. This is why, on average, we detect more hallucinations than the actual true number. It's worth noting that, during certain experiments where we deliberately included misleading prompts, we managed to mislead both the generation process and our evaluation of it. This proves that UniEval is not a silver bullet.
How are decoding strategies impacting evaluation dimensions?
For this experiment, we wanted to compare different ways of decoding information in Falcon-7b-Instruct. We tried several methods on 18 datasets, using 5 queries per dataset (90 queries in total):
According to earlier studies, the most effective method is contrastive decoding. It's important to note that greedy decoding is holding up reasonably well in this context, even though it is recognized as a somewhat constrained strategy. This might be attributed to the focus on short answers (with a maximum of 64 new tokens) or the possibility that UniEval isn't accurately assessing the “diversity” aspect.
In this blog, we aimed to give some insight into the challenges involved in evaluating LLMs, particularly in the context of question answering using RAG. This field is still in its early stages, with numerous papers being published on the subject. While UniEval isn't a cure-all solution, we find it to be a compelling approach that could offer a more precise assessment of how our RAG pipeline performs. This marks the initial step in an ongoing research endeavor here at Elastic. As always, our objective is to enhance the search experience, and we believe that solutions like UniEval, or similar approaches, will contribute to the development of valuable tools for our users.
The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.
Elastic, Elasticsearch and associated marks are trademarks, logos or registered trademarks of Elasticsearch N.V. in the United States and other countries. All other company and product names are trademarks, logos or registered trademarks of their respective owners.
]]>In this final two part blog of our series, we discuss some of the work we did for retrieval and inference performance for the release of version 2 of our Elastic Learned Sparse EncodeR model (ELSER), which we introduced in this previous blog post. In 8.11 we are releasing two versions of the model: one portable version which will run on any hardware and one version which is optimized for the x86 family of architectures. We're still making the deployment process easy though, by defaulting to the most appropriate model for your cluster's hardware.
In this first part we focus on inference performance. In the second part we discuss the ongoing work we're doing to improve retrieval performance. However, first we briefly review the relevance we achieve for BEIR with ELSER v2.
For this release we extended our training data, including around 390k high quality question and answer pairs to our fine tune dataset, and improved the FLOPS regularizer based on insights we discussed in the past. Together these changes gave us a bump in relevance measured with our usual set of BEIR benchmark datasets.
We plan to follow up with a full description of our training data set composition and the innovations we have introduced, such as improvements to cross-encoder distillation and the FLOPS regularizer at a later date. Since this blog post mainly focuses on performance considerations, we simply give the new NDCG@10 for ELSER v2 model in the table below.
<cite> NDCG@10 for BEIR data sets for ELSER v1 and v2 (higher is better). The v2 results use the query pruning method described below </cite>Model inference in the Elastic Stack is run on CPUs. There are two principal factors which affect the latency of transformer model inference: the memory bandwidth needed to load the model weights and the number of arithmetic operations it needs to perform.
ELSER v2 was trained from a BERT base checkpoint. This has just over 100M parameters, which amounts to about 418 MB of storage for the weights using 32 bit floating point precision. For production workloads for our cloud deployments we run inference on Intel Cascade Lake processors. A typical midsize machine would have L1 data, L2 and L3 cache sizes of around 64 KiB, 2 MiB and 33 MiB, respectively. This is clearly much smaller than model weight storage (although the number of weights which are actually used for any given inference is a function of text length). So for a single inference call we get cache misses all the way up to RAM. Halving the weight memory means we halve the memory bandwidth we need to serve an inference call.
Modern processors support wide registers which let one perform the same arithmetic operations in parallel on several pieces of data, so called SIMD instructions. The number of parallel operations one can perform is a function of the size of each piece of data. For example, Intel processors allow one to perform 8 bit integer multiplication in 16 bit wide lanes. This means one gets roughly twice as many operations per cycle for int8 versus float32 multiplication and this is the dominant compute cost in an inference call.
It is therefore clear if one were able to perform inference using int8 tensors there are significant performance improvements available. The process of achieving this is called quantization. The basic idea is very simple: clip outliers, scale the resulting numbers into the range 0 to 255 and snap them to the nearest integer. Formally, a floating point number $x$ is transformed using $\left\lfloor\frac{255}{u - l}(\text{clamp}(x, l, u) - l)\right\rceil$. One might imagine that the accuracy lost in this process would significantly reduce the model accuracy. In practice, large transformer model accuracy is fairly resilient to the errors this process introduces.
There is quite a lot of prior art on model quantization. We do not plan to survey the topic in this blog and will focus instead on the approaches we actually used. For background and insights into quantization we recommend these two papers.
For ELSER v2 we decided to use dynamic quantization of the linear layers. By default this uses per tensor symmetric quantization of activations and weights. Unpacking this, it rescales values to lie in an interval that is symmetric around zero - which makes the conversion slightly more compute efficient - before snapping. Furthermore, it uses one such interval for each tensor. With dynamic quantization the interval for each activation tensor is computed on-the-fly from their maximum absolute value. Since we want our model to perform well in a zero-shot setting, this has the advantage that we don't suffer from any mismatch in the data used to calibrate the model quantization and the corpus where it is used for retrieval.
The maximum absolute weight for each tensor is known in advance, so these can be quantized upfront and stored in int8 format. Furthermore, we note that attention is itself built out of linear layers. Therefore, if the matrix multiplications in linear layers are quantized the majority of the arithmetic operations in the model are performed in int8.
Our first attempt at applying dynamic quantization to every linear layer failed: it resulted in up to 20% loss in NDCG@10 for some of our BEIR benchmark data sets. In such cases, it is always worthwhile investigating hybrid quantization schemes. Specifically, one often finds that certain layers introduce disproportionately large errors when converted to int8. Typically, in such cases one performs layer by layer sensitivity analysis and greedily selects the layers to quantize while the model meets accuracy requirements.
There are many configurable parameters for quantization which relate to exact details of how intervals are constructed and how they are scoped. We found it was sufficient to choose between three approaches for each linear layer for ELSER v2:
There are a variety of tools which can allow one to observe tensor characteristics which are likely to create problems for quantization. However, ultimately what one always cares about is the model accuracy on the task it performs. In our case, we wanted to know how well the quantized model preserves the text representation we use for retrieval, specifically, the document scores. To this end, we quantized each layer in isolation and calculated the score MAPE of a diverse collection of query relevant document pairs. Since this had to be done on CPU and separately for every linear layer we limited this set to a few hundred examples. The figure below shows the performance and error characteristics for each layer; each point shows the percentage speed up in inference (x-axis) and the score MAPE (y-axis) as a result of quantizing just one layer. We run two experiments per layer: per tensor and per channel quantization.
<cite> Relevance scores MAPE for layerwise quantization of ELSER v2 </cite>Note that the performance gain is not equal for all layers. The feed forward layers that separate attention blocks use larger intermediate representations so we typically gain more by quantizing their weights. The MLM head computes vocabulary token activations. Its output dimension is the vocabulary size or 30522. This is the outlier on the performance axis; quantizing this layer alone increases throughput by nearly 13%.
Regarding accuracy, we see that quantizing the output of the 10<sup>th</sup> feed forward module in the attention stack has a dramatic impact and many layers have almost no impact on the scores (< 0.5% MAPE). Interestingly, we also found that the MAPE is larger when quantizing higher feed forward layers. This is consistent with the fact that dropping feed forward layers altogether at the bottom of the attention stack has recently been found to be an effective performance accuracy trade off for BERT. In the end, we chose to disable quantization for around 20% of layers and use per channel quantization for around 15% of layers. This gave us a 0.1% reduction in average NDCG@10 across the BEIR suite and a 2.5% reduction in the worst case.
So what does this yield in terms of performance improvements in the end? Firstly, the model size shrank by a little less than 40%, from 418 MB to 263MB. Secondly, inference sped up by between 40% and 100% depending on the text length. The figure below shows the inference latency on the left axis for the float32 and hybrid int8 model as a function of the input text length. This was calculated from 1000 different texts ranging for around 200 to 2200 characters (which typically translates to around the maximum sequence length of 512 tokens). For the short texts in this set we achieve a latency of around 50 ms or 20 inferences per second single threaded for an Intel Xeon CPU @ 2.80GH. Referring to the right axis, the speed-up for these short texts is a little over 100%. This is important because 200 characters is a long query so we expect similar improvements in query latency. We achieved a little under 50% throughput improvement for the data set as a whole.
<cite> Speed up per thread from hybrid int8 dynamic quantisation of ELSER v2 using an Intel Xeon CPU </cite>Another avenue we explored was using the Intel Extension for PyTorch (IPEX). Currently, we recommend our users run Elasticsearch inference nodes on Intel hardware and it makes sense to optimize the models we deploy to make best use of it.
As part of this project we rebuilt our inference process to use the IPEX backend. A nice side effect of this was that ELSER inference with float32 is 18% faster in 8.11 and we see increased throughput advantage from hyperthreading. However, the primary motivation was the latest Intel cores have hardware support for bfloat16 format, which makes better performance accuracy tradeoffs for inference than float32. We wanted to understand how this performs. We saw around 3 times speedup using bfloat16, but only with the latest hardware support; so until this is well enough supported in the cloud environment the use of bfloat16 models is impractical. We instead turned our attention to other features of IPEX.
The IPEX library provides several optimizations which can be applied to float32 layers. This is handy because, as discussed, we retain around 20% of the model in float32 precision.
Transformers don't afford simple layer folding opportunities, so the principal optimization is blocking of linear layers. Multi-dimensional arrays are usually stored flat to optimize cache use. Furthermore, to get the most out of SIMD instructions one ideally loads memory from contiguous blocks into the wide registers which implement them. The operations performed on the model weights in inference alter their access patterns. For any given compute graph one can in theory work out the weight layout which maximizes performance. The optimal arrangement also depends on the instruction set available and the memory bandwidth; usually this amounts to reordering weights into blocks for specific tensor dimensions. Fortunately, the IPEX library has implemented the optimal strategy for Intel hardware for a variety of layers, including linear layers.
The figure below shows the effect of applying optimal block layout for float32 linear layers in ELSER v2. The performance was averaged over 5 runs. The effect is small however we verified it is statistically significant (p-value < 0.05). Also, it is consistently slightly larger for longer sequences, so for our representative collection of 1000 texts it translated to a little under 1% increase in throughput.
<cite> Speed up per thread from IPEX optimize on ELSER v2 using an Intel Xeon CPU </cite>Another interesting observation we made is that the performance improvements are larger when using intra-op parallelism. We consistently achieved 2-5% throughput improvement across a range of text lengths using both our VM's allotted physical cores.
In the end, we decided not to enable these optimisations. The performance gains we get from them are small and they significantly increase the model memory: our script file increased from 263MB to 505MB. However, IPEX and particularly hardware support for bfloat16 yield significant improvements for inference performance on CPU. This work got us a step closer to enabling this for Elasticsearch inference in the future.
In this post, we discussed how we were able to achieve between a 60% and 120% speed-up in inference compared to ELSER v1 by upgrading the libtorch backend in 8.11 and optimizing for x86 architecture. This is all while improving zero-shot relevance. Inference performance is the critical factor in the time to index a corpus. It is also an important part of query latency. At the same time, the index performance is equally important for query latency, particularly at large scale. We discuss this in part 2.
The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.
Elastic, Elasticsearch and associated marks are trademarks, logos or registered trademarks of Elasticsearch N.V. in the United States and other countries. All other company and product names are trademarks, logos or registered trademarks of their respective owners.
]]>It has been noted that retrieval can be slow when using scores computed from learned sparse representations, such as ELSER. Slow is a relative term and in this context we mean slow when compared to BM25 scored retrieval. There are two principle reasons for this:
The first bottleneck can be tackled at train time, albeit with a relevance retrieval cost tradeoff. There is a regularizer term in the training loss which allows one to penalize using more terms in the query expansion. There are also gains to be had by performing better model selection.
When training any model it is sensible to keep the best one as optimisation progresses. Typically the quality is measured using the training loss function evaluated on a hold-out, or validation, dataset. We had found this metric alone did not correlate as well as we liked with zero-shot relevance; so we were already measuring NDCG@10 on several small datasets from the BEIR suite to help decide which model to retain. This allows us to measure other aspects of retrieval behavior. In particular, we compute the retrieval cost using the number of weight multiplications performed on average to find the top-k matches for every query.
We found that there is quite significant variation between the retrieval cost for relatively small variation in retrieval quality and used this information to identify Pareto optimal models. This was done for various choices of our regularization hyperparameters at different points along their learning trajectories. The figure below shows a scatter plot of the candidate models we considered characterized by their relevance and cost, together with the choice we made for ELSER v2. In the end we sacrificed around 1% in relevance for around a 25% reduction in the retrieval cost.
<cite> Performing model selection for ELSER v2 via relevance retrieval cost multi-objective optimization </cite>Whilst this is a nice win, the figure also shows there is only so much it is possible to achieve when making the trade off at train time. At least without significantly impacting relevance. As we discussed before, with ELSER our goal is to train a model with excellent zero-shot relevance. Therefore, if we make the tradeoff during training we make it in a global setting, without knowing anything about the specific corpus where the model will be applied. To understand how to overcome the dichotomy between relevance and retrieval cost we need to study the token statistics in a specific corpus. At the same time, it is also useful to understand why BM25 scoring is so efficient for retrieval.
The BM25 score comprises two factors, one which relates to its frequency in each document and one which relates to the frequency of each query term in the corpus. Focusing our attention on second factor, the score contribution of a term $t$ is weighted by its inverse document frequency (IDF) or $\log\left(\frac{1 - f_t}{f_t} + 1\right)$. Here $f_t=\frac{n_t+0.5}{N}$ and $n_t$ and $N$ denote the matching document count and total number of documents, respectively. So $f_t$ is just the proportion of the documents which contain that term, modulo a small correction which is negligible for large corpuses.
It is clear that IDF is a monotonic decreasing function of the frequency. Coupled with block-max WAND, this allows retrieval to skip many non-competitive documents even if the query includes frequent terms. Specifically, in any given block one might expect some documents to contain frequent terms, but with BM25 scoring they are unlikely to be competitive with the best matches for the query.
The figure below shows statistics related to the top tokens generated by ELSER v2 for the NFCorpus dataset. This is one of the datasets used to evaluate retrieval in the BEIR suite and comprises queries and documents related to nutrition. The token frequencies, expressed as a percentage of the documents which contain that token, are on the right hand axis and the corresponding IDF and the average ELSER v2 weight for the tokens are on the left hand axis. If one examines the top tokens they're what we might expect given the corpus content: things like “supplement”, “nutritional”, “diet”, etc. Queries expand to a similar set of terms. This underlines that even if tokens are well distributed in the training corpus as a whole, they can end up concentrated when we examine a specific corpus. Furthermore, we see that unlike BM25 the weight is largely independent of token frequency and this makes block-max WAND ineffective. The outcome is retrieval is significantly more expensive than BM25.
<cite> Average ELSER v2 weights and IDF for the top 500 tokens in the document expansions of NFCorpus together with the percentage of documents in which they appear </cite>Taking a step back, this suggests we reconsider token importance in light of the corpus subject matter. In a general setting, tokens related to nutrition may be highly informative. However, for a corpus about nutrition they are less so. This in fact is the underpinning of information theoretic approaches to retrieval. Roughly speaking we have two measures of the token information content for a specific query and corpus: its assigned weight - which is the natural analogue of the term frequency term used in BM25 - and the token frequency in the corpus as a whole - which we disregard when we score matches using the product of token weights. This suggests the following simple strategy to accelerate queries with hopefully little impact on retrieval quality:
We can calculate the expected fraction of documents a token will be present in, assuming they all occur with equal probability. This is just the ratio $\frac{N_T}{N|T|}$ where $N_T$ is the total number of tokens in the corpus, $N$ is the number of documents in the corpus and $|T|$ is the vocabulary size, which is 30522. Any token that occurs in a significantly greater fraction of documents than this is frequent for the corpus.
We found that pruning tokens which are 5 times more frequent than expected was an effective relevance retrieval cost tradeoff. We fixed the count of documents reranked using the full token set to 5 times the required set, so 50 for NDCG@10. We found we achieved more consistent results setting the weight threshold for which to retain tokens as a fraction of the maximum weight of any token in the query expansion. For the results below we retain all tokens whose weight is greater than or equal to 0.4 × “max token weight for the query”. This threshold was chosen so NDCG@10 was unchanged on NFCorpus. However, the same parameterization worked for the other 13 test datasets we tested, which strongly suggests that it generalizes well.
The table below shows the change in NDCG@10 relative to ELSER v2 with exact retrieval together with the retrieval cost relative to ELSER v1 with exact retrieval using this strategy. Note that the same pruning strategy can be applied to any learned sparse representation. However, we view that the key questions to answer are:
In summary, we achieved a very small improvement(!) of 0.07% in average NDCG@10 when we used the optimized query compared to the exact query and an average 3.4 times speedup. Furthermore, this speedup is measured without block-max WAND. As we expected, the optimization works particularly well together with block-max WAND. On a larger corpus (8.8M passages) we saw an 8.4 times speedup with block-max WAND enabled.
<cite> Measuring the relevance and latency impact of using token pruning followed by reranking. The relevance is measured by percentage change in NDCG@10 for exact retrieval with ELSER v2 and the speedup is measured with respect to exact retrieval with ELSER v1 </cite>An intriguing aspect of these results is that on average we see a small relevance improvement. Together with the fact that we previously showed carefully tuned combinations of ELSER v1 and BM25 scores yield very significant relevance improvements, it strongly suggests there are benefits available for relevance as well as for retrieval cost by making better use of corpus token statistics. Ideally, one would re-architect the model and train the query expansion to make use of both token weights and their frequencies. This is something we are actively researching.
As of Elasticsearch 8.13.0, we have integrated this optimization in the text_expansion
query via token pruning so it is automatically applied in the retrieval phase for the text_expansion query.
For versions of Elasticsearch before 8.13.0, it is possible to achieve the same results using existing Elasticsearch query DSL given an analysis of the token frequencies and their weights.
Tokens are stored in the _source field so it is possible to paginate through the documents and accumulate token frequencies to find out which tokens to exclude. Given an inference response one can partition the tokens into a “kept” and “dropped” set. The kept set is used to score the match in a should query. The dropped set is used in a rescore query on a window of the top 50 docs. Using query_weight and rescore_query_weight both equal to one simply sums the two scores so recovers the score using the full set of tokens. The query together with some explanation is shown below.
In these last two posts in our series we introduced the second version of the Elastic Learned Sparse EncodeR. So what benefits does it bring?
With some improvements to our training data set and regularizer we were able to obtain roughly a 2% improvement on our benchmark of zero-shot relevance. At the same time we've also made significant improvements to inference performance and retrieval latency.
We traded a small degradation (of a little less than 1%) in relevance for a large improvement (of over 25%) in the retrieval latency when performing model selection in the training loop. We also identified a simple token pruning strategy and verified it had no impact on retrieval quality. Together these sped up retrieval by between 2 and 5 times when compared to ELSER v1 on our benchmark suite. Token pruning can currently be implemented using Elasticsearch DSL, but we're also working towards performing it automatically in the text_expansion query.
To improve inference performance we prepared a quantized version of the model for x86 architecture and upgraded the libtorch backend we use. We found that these sped up inference by between 1.7 and 2.2 times depending on the text length. By using hybrid dynamic quantisation, based on an analysis of layer sensitivity to quantisation, we were able to achieve this with minimal loss in relevance.
We believe that ELSER v2 represents a step change in performance, so encourage you to give it a try!
This is an exciting time for information retrieval, which is being reshaped by rapid advances in NLP. We hope you've enjoyed this blog series in which we've tried to give a flavor of some of this field. This is not the end, rather the end of the beginning for us. We're already working on various improvements to retrieval in Elasticsearch and particularly in end-to-end optimisation of retrieval and generation pipelines. So stay tuned!
The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.
Elastic, Elasticsearch and associated marks are trademarks, logos or registered trademarks of Elasticsearch N.V. in the United States and other countries. All other company and product names are trademarks, logos or registered trademarks of their respective owners.
]]>We also discuss experiments we undertook to explore some general research questions. These include how best to parameterize Reciprocal Rank Fusion and how to calibrate Weighted Sum of Scores.
Despite modern training pipelines producing retriever models with good performance in zero-shot scenarios, it is known that lexical retrievers (such as BM25) and semantic retrievers (like Elastic Learned Sparse Encoder) are somewhat complementary. Specifically, it will improve relevance to combine the results of retrieval methods, if one assumes that the more matches occur between the relevant documents they retrieve than between the irrelevant documents they retrieve.
This hypothesis is plausible for methods using very different mechanisms for retrieval because there are many more irrelevant than relevant documents for most queries and corpuses. If methods retrieve relevant and irrelevant documents independently and uniformly at random, this imbalance means it is much more probable for relevant documents to match than irrelevant ones. We performed some overlap measurements to check this hypothesis between Elastic Learned Sparse Encoder, BM25, and various dense retrievers as shown in Table 1. This provides some rationale for using so-called hybrid search. In the following, we investigate two explicit implementations of hybrid search.
Reciprocal Rank Fusion was proposed in this paper. It is easy to use, being fully unsupervised and not even requiring score calibration. It works by ranking a document d with both BM25 and a model, and calculating its score based on the ranking positions for both methods. Documents are sorted by descending score. The score is defined as follows:
The method uses a constant k to adjust the importance of lowly ranked documents. It is applied to the top N document set retrieved by each method. If a document is missing from this set for either method, that term is set to zero.
The paper that introduces Reciprocal Rank Fusion suggests a value of 60 for k and doesn’t discuss how many documents N to retrieve. Clearly, ranking quality can be affected by increasing N while recall@N is increasing for either method. Qualitatively, the larger k the more important lowly ranked documents are to the final order. However, it is not a priori clear what would be optimal values of k and N for modern lexical semantic hybrid retrieval. Furthermore, we wanted to understand how sensitive the results are to the choice of these parameters and if the optimum generalizes between data sets and models. This is important to have confidence in the method in a zero-shot setting.
To explore these questions, we performed a grid search to maximize the weighted average NDCG@10 for a subset of the BEIR benchmark for a variety of models. We used Elasticsearch for retrieval in this experiment representing each document by a single text field and vector. The BM25 search was performed using a match query and dense retrieval using exact vector search with a script_score query.
Referring to Table 2, we see that for roberta-base-ance-firstp optimal values for k and N are 20 and 1000, respectively. We emphasize that for the majority of individual data sets, the same combination of parameters was optimal. We did the same grid search for distilbert-base-v3 and minilm-l12-v3 with the same conclusion for each model. It is also worth noting that the difference between the best and worst parameter combinations is only about 5%; so the penalty for mis-setting these parameters is relatively small.
We also wanted to see if we could improve the performance of Elastic Learned Sparse Encoder in a zero-shot setting using Reciprocal Rank Fusion. The results on the BEIR benchmark are given in Table 3.
Reciprocal Rank Fusion increases average NDCG@10 by 1.4% over Elastic Learned Sparse Encoder alone and 18% over BM25 alone. Also, importantly the result is either better or similar to BM25 alone for all test data sets. The improved ranking is achieved without the need for model tuning, training data sets, or specific calibration. The only drawback is that currently the query latency is increased as the two queries are performed sequentially in Elasticsearch. This is mitigated by the fact that BM25 retrieval is typically faster than semantic retrieval.
Our findings suggest that Reciprocal Rank Fusion can be safely used as an effective “plug and play” strategy. Furthermore, it is worth reviewing the quality of results one obtains with BM25, Elastic Learned Sparse Encoder and their rank fusion on your own data. If one were to select the best performing approach on each individual data set in the BEIR suite, the increase in average NDCG@10 is, respectively, 3% and 20% over Elastic Learned Sparse Encoder and BM25 alone.
As part of this work, we also performed some simple query classification to distinguish keyword and natural question searches. This was to try to understand the mechanisms that lead to a given method performing best. So far, we don’t have a clear explanation for this and plan to explore this further. However, we did find that hybrid search performs strongly when both methods have similar overall accuracy.
Finally, Reciprocal Rank Fusion can be used with more than two methods or could be used to combine rankings from different fields. So far, we haven’t explored this direction.
Another way to do hybrid retrieval supported by Elasticsearch is to combine BM25 scores and model scores using a linear function. This approach was studied in this paper, which showed it to be more effective than Reciprocal Rank Fusion when well calibrated. We explored hybrid search via a convex linear combination of scores defined as follows:
where α is the model score weight and is between 0 and 1.
Ideal calibration of linear combination is not straightforward, as it requires annotations similar to those used for fine-tuning a model. Given a set of queries and associated relevant documents, we can use any optimization method to find the optimal combination for retrieving those documents. In our experiments, we used BEIR data sets and Bayesian optimization to find the optimal combination, optimizing for NDCG@10. In theory, the ratio of score scales can be incorporated into the value learned for α. However, in the following experiments, we normalized BM25 scores and Elastic Learned Sparse Encoder scores per data set using min-max normalization, calculating the minimum and maximum from the top 1,000 scores for some representative queries on each data set. The hope was that with normalized scores the optimal value of transfers. We didn’t find evidence for this, but it is much more consistent and so normalization does likely improve the robustness of the calibration.
Obtaining annotations is expensive, so it is useful to know how much data to gather to be confident of beating Reciprocal Rank Fusion (RRF). Figure 1 shows the NDCG@10 for a linear combination of BM25 and Elastic Learned Sparse Encoder scores as a function of the number of annotated queries for the ArguAna data set. For reference, the BM25, Elastic Learned Sparse Encoder and RRF NDCG@10 are also shown. This sort of curve is typical across data sets. In our experiments, we found that it was possible to outperform RRF with approximately 40 annotated queries, although the exact threshold varied slightly from one data set to another.
We also observed that the optimal weight varies significantly both across different data sets (see Figure 2) and also for different retrieval models. This is the case even after normalizing scores. One might expect this because the optimal combination will depend on how well the individual methods perform on a given data set.
To explore the possibility of a zero-shot parameterisation, we experimented with choosing a single weight α for all data sets in our benchmark set. Although we used the same supervised approach to do this, this time choosing the weight to optimize average NDCG@10 for the full suite of data sets, we feel that there is enough variation between data sets that our findings may be representative of zero-shot performance.
In summary, this approach yields better average NDCG@10 than RRF. However, we also found the results were less consistent than RRF and we stress that the optimal weight is model specific. For this reason, we feel less confident the approach transfers to new settings even when calibrated for a specific model. In our view, linear combination is not a “plug and play” approach. Instead, we believe it is important to carefully evaluate the performance of the combination on your own data set to determine the optimal settings. However, as we will see below, if it is well calibrated it yields very good results.
Normalization is essential for comparing scores between different data sets and models, as scores can vary a lot without it. It is not always easy to do, especially for Okapi BM25, where the range of scores is unknown until queries are made. Dense model scores are easier to normalize, as their vectors can be normalized. However, it is worth noting that some dense models are trained without normalization and may perform better with dot products.
Elastic Learned Sparse Encoder is trained to replicate cross-encoder score margins. We typically see it produce scores in the range 0 to 20, although this is not guaranteed. In general, a query history and their top N document scores can be used to approximate the distribution and normalize any scoring function with minimum and maximum estimates. We note that the non-linear normalization could lead to improved linear combination, for example if there are score outliers, although we didn’t test this.
As for Reciprocal Rank Fusion, we wanted to understand the accuracy of a linear combination of BM25 and Elastic Learned Sparse Encoder — this time, though, in the best possible scenario. In this scenario, we optimize one weight α per data set to obtain the ideal NDCG@10 using linear combination. We used 300 queries to calibrate — we found this was sufficient to estimate the optimal weight for all data sets. In production, this scenario is realistically difficult to achieve because it needs both accurate min-max normalization and a representative annotated data set to adjust the weight. This would also need to be refreshed if the documents and queries drift significantly. Nonetheless, bounding the best case performance is still useful to have a sense of whether the effort might be worthwhile. The results are displayed in Table 4. This approach gives a 6% improvement in average NDCG@10 over Elastic Learned Sparse Encoder alone and 24% improvement over BM25 alone.
We showed it is possible to combine different retrieval approaches to improve their performance and in particular lexical and semantic retrieval complement one another. One approach we explored was Reciprocal Rank Fusion. This is a simple method that often yields good results without requiring any annotations nor prior knowledge of the score distribution. Furthermore, we found its performance characteristics were remarkably stable across models and data sets, so we feel confident that the results we observed will generalize to other data sets.
Another approach is Weighted Sum of Scores, which is more difficult to set up, but in our experiments yielded very good ranking with the right setup. To use this approach, scores should be normalized, which for BM25 requires score distributions for typical queries, furthermore some annotated data should be used for training the method weights.
In our final planned blog in this series, we will introduce the work we have been doing around inference and index performance as we move toward GA for the text_expansion feature.
The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.
]]>Historically, comparisons between BM25 and learned retrieval models have been based on limited data sets, or even only on the training data set of these dense models: MSMARCO, which may not provide an accurate representation of the models' performance on your data. Despite this approach being useful for demonstrating how well a dense model performs against BM25 in a specific domain, it does not capture one of BM25's key strengths: its ability to perform well in many domains without the need for supervised fine-tuning. Therefore, it may be considered unfair to compare these two methods using such a specific data set.
The BEIR paper ("BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models," by Takhur et al. 2021) offers to address the issue of evaluating information retrieval methods in a generic setting. The paper proposes a framework using 18 publicly available data sets from a diverse range of topics to benchmark state-of-the-art retrieval systems.
In this post, we use a subcollection of those data sets to benchmark BM25 against two dense models that have been specifically trained for retrieval. Then we will illustrate the potential gain achievable using fine-tuning strategies with one of those dense models. We plan to return to this benchmark in our next blog post, since it forms the basis of the testing we have done to enhance Elasticsearch relevance using language models in a zero-shot setting.
Performance can vary greatly between retrieval methods, depending on the type of query, document size, or topic. In order to assess the diversity of data sets and to identify potential blind spots in our benchmarks, a classification algorithm trained to recognize natural questions was used to understand queries typology. The results are summarized in Table 1.
In our benchmarks, we choose not to include MSMARCO to solely emphasize performance in unfamiliar settings. Evaluating a model in a setting that is different from its training data is valuable when the nature of your use case data is unknown or resource constraints prevent adapting the model specifically.
Selecting the appropriate metric is crucial in evaluating a model's ranking ability accurately. Of the various metrics available, three are commonly utilized for search relevance:
All of these metrics are applied to a fixed-sized list of retrieved documents. The list size can vary depending on the task at hand. For example, a preliminary retrieval before a reranking task might consider the top 1000 retrieved documents, while a single-stage retrieval might use a smaller list size to mimic a user's search engine behavior. We have chosen to fix the list size to the top 10 documents, which aligns with our use case.
In our previous blog post, we noted that dense models, due to their training design, are optimized for specific data sets. While they have been shown to perform well on this particular data set, in this section we explore if they maintain their performance when used out-of-domain. To do this, we compare the performance of two state-of-the-art dense retrievers (msmarco-distilbert-base-tas-b and msmarco-roberta-base-ance-fristp) with BM25 in Elasticsearch using the default settings and English analyzer.
Those two dense models both outperform BM25 on MSMARCO (as seen in the BEIR paper), as they are trained specifically on this data set. However, they are usually worse out-of-domain. In other words, if a model is not well adapted to your specific data, it’s very likely that using kNN and dense models would degrade your retrieval performance in comparison to BM25.
The portrayal of dense models in the previous description isn't the full picture. Their performance can be improved by fine-tuning them for a specific use case with some labeled data that represents that use case. If you have a fine-tuned embedding model, the Elastic Stack is a great platform to both run the inference for you and retrieve similar documents using ANN search.
There are various methods for fine-tuning a dense model, some of which are highly sophisticated. However, this blog post won't delve into those methods as it's not the focus. Instead, two methods were tested to gauge the potential improvement that can be achieved with not a lot of domain specific training data. The first method (FineTuned A) involved using labeled positive documents and randomly selecting documents from the corpus as negatives. The second method (FineTuned B) involved using labeled positive documents and using BM25 to identify documents that are similar to the query from BM25's perspective, but aren't labeled as positive. These are referred to as "hard negatives."
Labeling data is probably the most challenging aspect in fine-tuning. Depending on the subject and field, manually tagging positive documents can be expensive and complex. Incomplete labeling can also create problems for hard negatives mining, causing adverse effects on fine-tuning. Finally, changes to the topic or semantic structure in a database over time will reduce retrieval accuracy for fine-tuned models.
We have established a foundation for information retrieval using 13 data sets. The BM25 model performs well in a zero-shot setting and even the most advanced dense models struggle to compete on every data set. These initial benchmarks indicate that current SOTA dense retrieval cannot be used effectively without proper in-domain training. The process of adapting the model requires labeling work, which may not be feasible for users with limited resources.
In our next blog, we will discuss alternative approaches for efficient retrieval systems that do not require the creation of a labeled data set. These solutions will be based on hybrid retrieval methods.
]]>
Given all these components and their parameters, and depending on the text corpus you want to search in, it can be overwhelming to choose which settings will give the best search relevance.
In this series of blog posts, we will introduce a number of tests we ran using various publicly available data sets and information retrieval techniques that are available in the Elastic Stack. We’ll then provide recommendations of the best techniques to use depending on the setup.
To kick off this series of blogs, we want to set the stage by describing the problem we are addressing and describe some methods we will dig further into in subsequent blogs.
The classic way documents are ranked for relevance by Elasticsearch according to a text query uses the Lucene implementation of the Okapi BM25 model. Although a few hyperparameters of this model were fine-tuned to optimize the results in most scenarios, this technique is considered unsupervised as labeled queries and documents are not required to use it: it’s very likely that the model will perform reasonably well on any corpus of text, without relying on annotated data. BM25 is known to be a strong baseline in zero-shot retrieval settings.
Under the hood, this kind of model builds a matrix of term frequencies (how many times a term appears in each document) and inverse document frequencies (inverse of how many documents contain each term). It then scores each query term for each document that was indexed based on those frequencies. Because each document typically contains a small fraction of all words used in the corpus, the matrix contains a lot of zeros. This is why this type of representation is called sparse.
Also, this model sums the relevance score of each individual term within a query for a document, without taking into account any semantic knowledge (synonyms, context, etc.). This is called lexical search (as opposed to semantic search). Its shortcoming is the so-called vocabulary mismatch problem, that query vocabulary is slightly different to the document vocabulary. This motivates other scoring models that try to incorporate semantic knowledge to avoid this problem.
More recently, transformer-based models have allowed for a dense, context aware representation of text, addressing the principal shortcomings mentioned above.
To build such models, the following steps are required:
1. Pre-training
We first need to train a neural network to understand the basic syntax of natural language.
Using a huge corpus of text, the model learns semantic knowledge by training on unsupervised tasks (like Masked Word Prediction or Next Sentence Prediction).
BERT is probably the best known example of these models — it was trained on Wikipedia (2.5B words) and BookCorpus (800M words) using Masked Word Prediction.
This is called pre-training. The model learns vector representations of language tokens, which can be adapted for other tasks with much less training.
Note that at this step, the model wouldn’t perform well on downstream NLP tasks.
This step is very expensive, but many such foundational models exist that can be used off the shelf.
2. Task-specific training
Now that the model has built a representation of natural language, it’ll train much more effectively on a specific task such as Dense Passage Retrieval (DPR) that allows Question Answering.
To do so, we must slightly adapt the model’s architecture and then train it on a large number of instances of the task, which, for DPR, consists in matching a relevant passage taken from a relevant document.
So this requires a labeled data set, that is, a collection of triplets :
A very popular and publicly available data set to perform such a training for DPR is the MS MARCO data set.
This data set was created using queries and top results from Microsoft’s Bing search engine. As such, the queries and documents it contains fall in the general knowledge linguistic domain, as opposed to specific linguistic domain (think about research papers or language used in law).
This notion of linguistic domain is important, as the semantic knowledge learned by those models is giving them an important advantage “in-domain”: when BERT came out, it improved previous state of the art models on this MS MARCO data set by a huge margin.
3. Domain-specific training
Depending on how different your data is from the data set used for task-specific training, you might need to train your model using a domain specific labeled data set. This step is also referred to as fine tuning for domain adaptation or domain-adaptation.
The good news is that you don’t need as large a data set as was required for the previous steps — a few thousands or tens of thousands of instances of the tasks can be enough.
The bad news is that these query-document pairs need to be built by domain experts, so it’s usually a costly option.
The domain adaptation is roughly similar to the task-specific training.
Having introduced these various techniques, we will measure how they perform on a wide variety of data sets. This sort of general purpose information retrieval task is of particular interest for us. We want to provide tools and guidance for a range of users, including those who don’t want to train models themselves in order to gain some of the benefits they bring to search. In the next blog post of this series, we will describe the methodology and benchmark suite we will be using.
]]>
In our previous blog post in this series, we discussed some of the challenges applying dense models to retrieval in a zero-shot setting. This is well known and was highlighted by the BEIR benchmark, which assembled diverse retrieval tasks as a proxy to the performance one might expect from a model applied to an unseen data set. Good retrieval in a zero-shot setting is exactly what we want to achieve, namely a one-click experience that enables textual fields to be searched using a pre-trained model.
This new capability fits into the Elasticsearch _search endpoint as just another query clause, a text_expansion query. This is attractive because it allows search engineers to continue to tune queries with all the tools Elasticsearch already provides. Furthermore, to truly achieve a one-click experience, we've integrated it with the new Elasticsearch Relevance Engine. However, rather than focus on the integration, this blog digs a little into ELSER's model architecture and the work we did to train it.
We had another goal at the outset of this project. The natural language processing (NLP) field is fast moving, and new architectures and training methodologies are being introduced rapidly. While some of our users will keep on top of the latest developments and want full control over the models they deploy, others simply want to consume a high quality search product. By developing our own training pipeline, we have a playground for implementing and evaluating the latest ideas, such as new retrieval relevant pre-training tasks or more effective distillation tasks, and making the best ones available to our users.
Finally, it is worth mentioning that we view this feature as complementary to the existing model deployment and vector search capabilities in the Elastic Stack, which are needed for those more custom use cases like cross-modal retrieval.
Before looking at some of the details of the architecture and how we trained our model, the Elastic Learned Sparse Encoder (ELSER), it's interesting to review the results we get, as ultimately the proof of the pudding is in the eating.
As we discussed before, we use a subset of BEIR to evaluate our performance. While this is by no means perfect, and won't necessarily represent how the model behaves on your own data, we at least found it challenging to make significant improvements on this benchmark. So we feel confident that improvements we get on this translate to real improvements in the model. Since absolute performance numbers on benchmarks by themselves aren't particularly informative, it is nice to be able to compare with other strong baselines, which we do below.
The table below shows the performance of Elastic Learned Sparse Encoder compared to Elasticsearch's BM25 with an English analyzer broken down by the 12 data sets we evaluated. We have 10 wins, 1 draw, and 1 loss and an average improvement in NDCG@10 of 17%.
<cite> NDCG@10 for BEIR data sets for BM25 and Elastic Learned Sparse Encoder (referred to as “ELSER” above, note higher is better) </cite>In the following table, we compare our average performance to some other strong baselines. The Vespa results are based on a linear combination of BM25 and their implementation of ColBERT as reported here, the Instructor results are from this paper, the SPLADEv2 results are taken from this paper and the OpenAI results are reported here. Note that we've separated out the OpenAI results because they use a different subset of the BEIR suite. Specifically, they average over ArguAna, Climate FEVER, DBPedia, FEVER, FiQA, HotpotQA, NFCorpus, QuoraRetrieval, SciFact, TREC COVID and Touche. If you follow that link, you will notice they also report NDCG@10 expressed as a percentage. We refer the reader to the links above for more information on these approaches.
<cite> Average NDCG@10 for BEIR data sets vs. various high quality baselines (higher is better). Note: OpenAI chose a different subset, and we report our results on this set separately. </cite>Finally, we note it has been widely observed that an ensemble of statistical (a la BM25) and model based retrieval, or hybrid search, tends to outperform either in a zero-shot setting. Already in 8.8, Elastic allows one to do this for text_expansion with linear boosting and this works well if you calibrate to your data set. We are also working on Reciprocal Rank Fusion (RRF), which performs well without calibration. Stay tuned for our next blog in this series, which will discuss hybrid search.
Having seen how ELSER performs, we next discuss its architecture and some aspects of how it is trained.
We showed in our previous blog post that, while very effective if fine-tuned, dense retrieval tends not to perform well in a zero-shot setting. By contrast cross-encoder architectures, which don't scale well for retrieval, tend to learn robust query and document representations and work well on most text. It has been suggested that part of the reason for this difference is the bottleneck of the query and document interacting only via a relatively low dimensional vector “dot product.” Based on this observation, a couple of model architectures have been recently proposed that try to reduce this bottleneck — these are ColBERT and SPLADE.
From our perspective, SPLADE has some additional advantages:
One last clear advantage compared to dense retrieval is that SPLADE allows one a simple and compute efficient route to highlight words generating a match. This simplifies surfacing relevant passages in long documents and helps users better understand how the retriever is working. Taken together, we felt that these provided a compelling case for adopting the SPLADE architecture for our initial release of this feature.
There are multiple good detailed descriptions of this architecture — if you are interested in diving in, this, for example, is a nice write up by the team that created the model. In very brief outline, the idea is rather than use a distributed representation, say averaging BERT token output embeddings, instead use the token logits, or log-odds the tokens are predicted, for masked word prediction.
When language models are used to predict masked words, they achieve this by predicting a probability distribution over the tokens of their vocabulary. The BERT vocabulary, for WordPiece, contains many common real words such as cat, house, and so on. It also contains common word endings — things like ##ing (with the ## simply denoting it is a continuation). Since words can't be arbitrarily exchanged, relatively few tokens will be predicted for any given mask position. SPLADE takes as a starting point for its representation of a piece of text the tokens most strongly predicted by masking each word of that text. As noted, this is a naturally disentangled or sparse representation of that text.
It is reasonable to think of these token probabilities for word prediction as roughly capturing contextual synonyms. This has led people to view learned sparse representations, such as SPLADE, as something close to automatic synonym expansion of text, and we see this in multiple online explanations of the model.
In our view, this is at best an oversimplification and at worst misleading. SPLADE takes as the starting point for fine-tuning the maximum token logits for a piece of text, but it then trains on a relevance prediction task, which crucially accounts for the interaction between all shared tokens in a query and document. This process somewhat re-entangles the tokens, which start to behave more like components of a vector representation (albeit in a very high dimensional vector space).
We explored this a little as we worked on this project. We saw as we tried removing low score and apparently unrelated tokens in the expansion post hoc that it reduced all quality metrics, including precision(!), in our benchmark suite. This would be explained if they were behaving more like a distributed vector representation, where zeroing individual components is clearly nonsensical. We also observed that we can simply remove large parts of BERT's vocabulary at random and still train highly effective models as the figure below illustrates. In this context, parts of the vocabulary must be being repurposed to account for the missing words.
<cite> Margin MSE validation loss for student models with different vocabulary sizes </cite>Finally, we note that unlike say generative tasks where size really does matter a great deal, retrieval doesn't as clearly benefit from having huge models. We saw in the result section that this approach is able to achieve near state-of-the-art performance with only 100M parameters, as compared to hundreds of millions or even billions of parameters in some of the larger generative models. Typical search applications have fairly stringent requirements on query latency and throughput, so this is a real advantage.
In our first blog, we introduced some of the ideas around training dense retrieval models. In practice, this is a multi stage process and one typically picks up a model that has already been pre-trained. This pre-training task can be rather important for achieving the best possible results on specific downstream tasks. We don't discuss this further because to date this hasn't been our focus, but note in passing that like many current effective retrieval models, we start from a co-condenser pre-trained model.
There are many potential avenues to explore when designing training pipelines. We explored quite a few, and suffice to say, we found making consistent improvements on our benchmark was challenging. Multiple ideas that looked promising on paper didn't provide compelling improvements. To avoid this blog becoming too long, we first give a quick overview of the key ingredients of the training task and focus on one novelty we introduced, which provided the most significant improvements. Independent of specific ingredients, we also made some qualitative and quantitative observations regarding the role of the FLOPS regularization, which we will discuss at the end.
When training models for retrieval, there are two common paradigms: contrastive approaches and distillation approaches. We adopted the distillation approach because this was shown to be very effective for training SPLADE in this paper. The distillation approach is slightly different from the common paradigm, which informs the name, of shrinking a large model to a small, but almost as accurate, “copy.” Instead the idea is to distill the ranking information present in a cross-encoder architecture. This poses a small technical challenge: since the representation is different, it isn't immediately clear how one should mimic the behavior of the cross-encoder with the model being trained. The standard idea we used is to present both models with triplets of the form (query, relevant document, irrelevant document). The teacher model computes a score margin, namely $score(query, relevant;document) - score(query;irrelevant;document)$, and we train the student model to reproduce this score margin using MSE to penalize the errors it makes.
Let's think a little about what this process does since it motivates the training detail we wish to discuss. If we recall that the interaction between a query and document using the SPLADE architecture is computed using the dot product between two sparse vectors, of non-negative weights for each token, then we can think about this operation as wanting to increase the similarity between the query and the higher scored document weight vectors. It is not 100% accurate, but not misleading, to think of this as something like “rotating” the query in the plane spanned by the two documents' weight vectors toward the more relevant one. Over many batches, this process gradually adjusts the weight vectors starting positions so the distances between queries and documents captures the relevance score provided by the teacher model.
This leads to an observation regarding the feasibility of reproducing the teacher scores. In normal distillation, one knows that given enough capacity the student would be able to reduce the training loss to zero. This is not the case for cross-encoder distillation because the student scores are constrained by the properties of a metric space induced by the dot product on their weight vectors. The cross-encoder has no such constraint. It is quite possible that for particular training queries $q_1$ and $q_2$ and documents $d_1$ and $d_2$ we have to simultaneously arrange for $q_1$ to be close to $d_1$ and $d_2$, and $q_2$ to be close to $d_1$ but far from $d_2$. This is not necessarily possible, and since we penalize the MSE in the scores, one effect is an arbitrary reweighting of the training triplets associated with these queries and documents by the minimum margin we can achieve.
One of the observations we had while working on training ELSER was the teacher was far from infallible. We initially observed this by manually investigating query-relevant document pairs that were assigned unusually low scores. In the process, we found objectively misscored query-document pairs. Aside from manual intervention in the scoring process, we also decided to explore introducing a better teacher.
Following the literature, we were using MiniLM L-6 from the SBERT family for our initial teacher. While this shows strong performance in multiple settings, there are better teachers, based on their ranking quality. One example is a ranker based on a large generative model: monot5 3b. In the figure below, we compare the query-document score pair distribution of these two models. The monot5 3b distribution is clearly much less uniform, and we found when we tried to train our student model using its raw scores the performance saturated significantly below using MiniLM L-6 as our teacher. As before, we postulated that this was down to many important score differences in the peak around zero getting lost with training worried instead about unfixable problems related to the long lower tail.
<cite> Monot5 3b and MiniLM L-6 score distributions on a matched scale for a random sample of query-document pairs from the NQ data set. Note: the X-axis does not show the actual scores returned by either of the models. </cite>It is clear that all rankers are of equivalent quality up to monotonic transforms of their scores. Specifically, it doesn't matter if we use $score(query, document)$ or $f(score(query, document))$ provided $f(\cdot)$ is a monotonic increasing function; any ranking quality measure will be the same. However, not all such functions are equivalently effective teachers. We used this fact to smooth out the distribution of monot5 3b scores, and suddenly our student model trained and started to beat the previous best model. In the end, we used a weighted ensemble of our two teachers.
Before closing out this section, we want to briefly mention the FLOPS regularizer. This is a key ingredient of the improved SPLADE v2 training process. It was proposed in this paper as a means of penalizing a metric directly related to the compute cost for retrieval from an inverted index. In particular, it encourages tokens that provide little information for ranking to be dropped from the query and document representations based on their impact on the cost for retrieving from an inverted index. We had three observations:
So why could this be, since it is primarily aimed at optimizing retrieval cost? The FLOPS regularizer is defined as follows: it first averages the weights for each token in the batch across all the queries and separately the documents it contains, it then sums the squares of these average weights. If we consider that the batch typically contains a diverse set of queries and documents, this acts like a penalty that encourages something analogous to stop word removal. Tokens that appear for many distinct queries and documents will dominate the loss, since the contribution from rarely activated tokens is divided by the square of the batch size. We postulate that this is actually helping the model to find better representations for retrieval. From this perspective, the fact that the regularizer term only gets to observe the token weights of queries and documents in the batch is undesirable. This is an area we'd like to revisit.
We have given a brief overview of the model, the Elastic Learned Sparse Encoder (ELSER), its rationale, and some aspects of the training process behind the feature we're releasing in a technical preview for the new text_expansion query and integrating with the new Elasticsearch Relevance Engine. To date, we have focused on retrieval quality in a zero-shot setting and demonstrated good results against a variety of strong baselines. As we move toward GA, we plan to do more work on operationalizing this model and in particular around improving inference and retrieval performance.
Stay tuned for the next blog post in this series, where we'll look at combining various retrieval methods using hybrid retrieval as we continue to explore exciting new retrieval methods using Elasticsearch.
The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.
Elastic, Elasticsearch and associated marks are trademarks, logos or registered trademarks of Elasticsearch N.V. in the United States and other countries. All other company and product names are trademarks, logos or registered trademarks of their respective owners.
]]>
random_sampler
aggregation. It adds the capability to randomly sample documents in a statistically robust manner. Randomly sampling documents in aggregations allows you to balance speed and accuracy at query time. You can aggregate billions of documents with high accuracy at a fraction of the latency. This allows you to achieve faster results with fewer resources and comparable accuracy — all with a simple aggregation.
Let’s run through some basic details, best practices, and how it works, so you can try it out in the Elasticsearch Service today.
Random sampling in Elasticsearch has never been easier or faster. If your query has many aggregations, you can quickly obtain results by using the random_sampler
aggregation.
POST _search?size=0&track_total_hits=false
{
"aggs": {
"sampled": {
"random_sampler": {
"probability": 0.001,
"seed": 42
},
"aggs": {
...
}
}
}
}
All the above aggregations nested under random_sampler
will return sampled results. Each agg is roughly seeing only 0.1% of the documents (or 1 in every 1000th document). Where computational cost correlates with the number of documents, the aggregation speed increases. You may have also noticed the “seed
” parameter. You can provide a seed
to get consistent results on the same shards. Without a seed, a new random subset of documents is considered and you may get slightly different aggregated results.
How much faster is the random_sampler
? The speed improves according to the provided probability as fewer documents are aggregated. The improvements relative to probability will eventually flatten out. Each aggregation has its own computational overhead regardless of the number of documents. An example of this overhead cost is comparing multi-bucket to single metric aggregations. Multi-bucket aggregations have a higher overhead due to their bucket handling logic. While speed is improved for multi-bucket aggregations, the rate of that speed increase will flatten out sooner than single metric.
![enter image description here](/assets/images/aggregate-data-faster-with-new-the-random-sampler-aggregation/elastic-random-sampler.png)
Figure 1. The speedup expected for aggregations of different constant overhead.
Here are some results on expected speed and error rate over an APM data set of 64 million documents.
The calculations are from: 300 query and aggregation combinations, 5 seeds, and 9 sampling probabilities. In total, 13,500 separate experiments generated the following graphs for median speedup and median relative error as a function of the downsample factor which is 1 / sample probability.
![enter image description here](/assets/images/aggregate-data-faster-with-new-the-random-sampler-aggregation/elastic-random-sampler-2.png)
Figure 2. Median speedup as a function of the downsample factor (or 1 / probability provided for the sampler).
![enter image description here](/assets/images/aggregate-data-faster-with-new-the-random-sampler-aggregation/elastic-random-sampler-3.png)
Figure 3. Median relative error as a function of the downsample factor (or 1 / probability provided for the sampler).
With a probability of 0.001, for half of the scenarios tested, there was an 80x speed improvement or better with a 4% relative error or less. These tests involved a little over 64 million documents but spread across many shards. More compact shards and larger data can expect better results.
But, you may ask, do the visualizations look the same?
Below are two visualizations showing document counts for every 5 minutes over 100+ million documents. The total set loads in seconds and is sampled in milliseconds. This is with almost no discernible visual difference.
![enter image description here](/assets/images/aggregate-data-faster-with-new-the-random-sampler-aggregation/elastic-random-sampler-4.png)
Figure 4. Sampled vs unsampled document count visualizations.
Here is another example. This time the average transaction by hour is calculated and visualized. While visually these are not exactly the same, the overall trends are still evident. For a quick overview of the data to catch trends, sampling works marvelously.
![enter image description here](/assets/images/aggregate-data-faster-with-new-the-random-sampler-aggregation/elastic-rrandom-sampler-5.png)
Figure 5. Sampled vs. unsampled average transaction time by hour visualization. < -
Sampling shines when you have a large data set. In these cases you might ask, should I sample before the data is indexed in Elasticsearch? Sampling at query time and before ingestion are complimentary. Each has its distinct advantages.
When sampling at ingest time, it can save disk and indexing costs. However, if your data has multiple facets, you have to stratify sampling over facets when sampling before ingestion, unless you know exactly how it will be queried. This suffers from the curse of dimensionality and you could end up with underrepresented sets of facets. Furthermore, you have to cater for the worst case when sampling before ingestion. For example, if you want to compute percentiles for two queries, one which matches 50% of the documents and one which matches 1% of documents in an index, you can get away with 7X more downsampling for the first query and achieve the same accuracy.
Here is a summary of what to expect from sampling with the random_sampler
at query time.
![enter image description here](/assets/images/aggregate-data-faster-with-new-the-random-sampler-aggregation/elastic-random-sampler-6.png)
Figure 6. Relative error for different aggregations.
Sampling accuracy varies across aggregations (see Figure 5 for some examples). Here is a list of some aggregations in order of descending accuracy: percentiles, counts, means, sums, variance, minimum, and maximum. Metric aggregation accuracy will also be affected by the underlying data variation: the lower the variation in the values, the fewer samples you need to get accurate aggregate values. The minimum and maximum will not be reliable with outliers, since there is always a reasonable chance that the sampled set misses the one very large (or small) value in the data set. If you are using terms aggregations (or some partitioning such as date histogram), aggregate values for terms (or buckets) with few values will be less accurate or missed altogether.
Aggregations also have fixed overheads (see Figure 1 for an example). This means as the sample size decreases, the performance improvement will eventually level out. Aggregations which have many buckets have higher overheads and so the speedup you will gain from sampling is smaller. For example, a terms aggregation for a high cardinality field will show less performance benefit.
If in doubt, some simple experiments will often suffice to determine good settings for your data set. For example, suppose you want to speed up a dashboard; try reducing the sample probability while the visualizations look similar enough. Chances are your data characteristics will be stable and so this setting will remain reliable.
Sampling considers the entire document set within a shard. Once it creates the sampled document set, sampling applies any provided user filter. The documents that match the filter and are within the sampled set are then aggregated (see Figure 7).
![enter image description here](/assets/images/aggregate-data-faster-with-new-the-random-sampler-aggregation/elastic-random-sampler-7.png)
Figure 7. Typical request and data flow for the random_sampler aggregation.
The key to the sampling is generating this random subset of the shard efficiently and without statistical biases. Taking geometrically distributed random steps through the document set is equivalent to uniform random sampling, meaning each document in the set has an equally likely chance of being selected into the sample set. The advantage of this approach is that the sampling cost scales with p (where p is the probability configured in the aggregation). This means no matter how small p is, the relative latency of performing the sampling adds will remain fixed.
To achieve the highest performance, accuracy, and robustness, we evaluated a range of realistic scenarios.
In the case of random_sampler,
the evaluation process is complicated by two factors:
We began with a proof of concept that showed that the overall strategy worked and the performance characteristics were remarkable. However, there are multiple factors which can affect implementation performance and accuracy. For example, we found the off-the-shelf sampling code for the geometric distribution was not fast enough. We decided to roll our own using some tricks to extract more random samples per random bit along with a very fast quantized version of the log function. You also need to be careful that you are generating statistically independent samples for different shards. In summary, as is often the case, the devil is in the details.
Undaunted, we wrote a test harness using the Elastic Python client to programmatically generate aggregations and queries, and perform statistical tests of quality.
We wanted the approximations we produce to be unbiased. This means if you run a sampled aggregation repeatedly and averaged the results it would converge towards the true value. Standard machinery allows you to test if there is statistically significant evidence of bias. We used a t-test for the difference between the statistic and true value for each aggregation. In over 300 different experiments, the minimum p-value was around 0.0003 which — given we ran 300 experiments — has about a 9% odds of occurring by chance. This is a little low, but not enough to worry about; furthermore the median p-value was 0.38.
We also tested whether various index properties affect the statistical properties. For example, we wanted to see if we could measure a statistically significant difference between the distribution of results with and without index sorting. A K-S test can be used to check if samples come from the same distribution. In our 300 experiments the smallest p-value was around 0.002 which occurs with odds of about 45% by chance.
We’re not done with this feature yet. Once you have the ability to generate fast approximate results, a key question is: how accurate are those results? We’re planning to integrate a confidence interval calculation directly into the aggregation framework to answer this efficiently in a future release. Learn more about random_sampler_aggregation in this documentation. You can explore this feature and more with a free 14-day trial of Elastic Cloud.
The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.
]]>