This post shares strategies for implementing academic papers in a software application. It draws on examples from Elasticsearch and Lucene in hopes of helping other engineers learn from our experiences. You might read these strategies and think “but this is just software development!” And that would indeed be true: as engineers we already have the right practices and tools, they just need to be adapted to a new challenge.
Background
While developing Elasticsearch, we occasionally come across an important problem with no simple or established approach to solving it. It's natural to ask “hmm, is there an academic paper that addresses this?” Other times, academic work is a source of inspiration. We'll encounter a paper proposing a new algorithm or data structure and think “this would be so useful!” Here are just a few examples of how Elasticsearch and Apache Lucene incorporate academic work:
- HyperLogLog++ for cardinality aggregations
- C3 algorithm for adaptive replica selection
- Hierarchical Navigable Small World Graphs (HNSW) for nearest vector search in Lucene
- MIC statistic to improve machine learning classification
- Block-max WAND for faster top-hits retrieval in Lucene
- ... and many more
Academic papers are an invaluable resource for engineers developing data-intensive systems. But implementing them can be intimidating and error-prone — algorithm descriptions are often complex, with important practical details omitted. And testing is a real challenge: for example, how can we thoroughly test a machine learning algorithm whose output depends closely on the dataset?
Evaluate the paper as you would a software dependency
Adding a new software dependency requires careful evaluation: if the other package is incorrect, slow, or insecure, our project could be too. Before pulling in a dependency, developers make sure to evaluate its quality.
The same applies to academic papers you're considering implementing. It may seem that because an algorithm was published in a paper, it must be correct and perform well. But even though it passed a review process, an academic paper can have issues. Maybe the correctness proof relies on assumptions that aren't realistic. Or perhaps the “experiments” section shows much better performance than the baseline, but this only holds on a specific dataset. Even if the paper is of great quality, its approach may not be a good fit for your project.
When thinking about whether to take a “dependency” on an academic paper, it's helpful to ask the same questions we would of a software package:
- Is the library widely-used and “battle tested”? → Have other packages implemented this paper, and has it worked well for them?
- Are performance benchmarks available? Do these seem accurate and fair? → Does the paper include realistic experiments? Are they well designed?
- Is a performance improvement big enough to justify the complexity? → Does the paper compare to a strong baseline approach? How much does it outperform this baseline?
- Will the approach integrate well with our system? → Do the algorithm's assumptions and trade-offs fit our use case?
Somehow when a software package publishes a performance comparison against their competitors, the package always comes out fastest! If a third party designed the benchmarks, they may be more balanced. The same phenomenon applies to academic papers. If an algorithm performs well not only in the original paper, but also appears in other papers as a strong baseline, then it is very likely to be solid.
Get creative with testing
Algorithms from academic papers often have more sophisticated behavior than the types of algorithms we routinely encounter. Perhaps it's an approximation algorithm that trades off accuracy for better speed. Or maybe it's a machine learning method that takes in a large dataset, and produces (sometimes unexpected) outputs. How can we write tests for these algorithms if we can't characterize their behavior in a simple way?
Focus on invariants
When designing unit tests, it's common to think in terms of examples: if we give the algorithm this example input, it should have that output. Unfortunately for most mathematical algorithms, example-based testing doesn't sufficiently cover their behavior.
Let's consider the C3 algorithm, which Elasticsearch uses to figure out what node should handle a search request. It ranks each node using a nuanced formula that incorporates the node's previous service and response times, and its queue size. Testing a couple examples doesn't really verify we understood the formula correctly. It helps to step back and think about testing invariants: if service time increases, does the node's rank decrease? If the queue size is 0, is the rank determined by response time, as the paper claims?
Focusing on invariants can help in a number of common cases:
- Is the method supposed to be order-agnostic? If so, passing the input data in a different order should result in the same output.
- Does some step in the algorithm produce class probabilities? If so, these probabilities should sum to 1.
- Is the function symmetric around the origin? If so, flipping the sign of the input should simply flip the sign of the output.
When we first implemented C3, we had a bug in the formula where we accidentally used the inverse of response time in place of response time. This meant slower nodes could be ranked higher! When fixing the issue, we made sure to add invariant checks to guard against future mistakes.
Compare to a reference implementation
Alongside the paper, the authors hopefully published an implementation of the algorithm. (This is especially likely if the paper contains experiments, as many journals require authors to post code for reproducing the results.) You can test your approach against this reference implementation to make sure you haven't missed important details of the algorithm.
While developing Lucene's HNSW implementation for nearest-neighbor search, we tested against a reference library by the paper's authors. We ran both Lucene and the library against the same dataset, comparing the accuracy of their results and the number of computations they performed. When these numbers match closely, we know that Lucene faithfully implements the algorithm.
When incorporating an algorithm into a system, you often need to make modifications or extensions, like scaling it to multiple cores, or adding heuristics to improve performance. It's best to first implement a "vanilla" version, test it against the reference, then make incremental changes. That way you can be confident you've captured all the key parts before making customizations.
Duel against an existing algorithm
The last section raises another idea for a test invariant: comparing the algorithm's output to a simpler and better-understood algorithm's output. As an example, consider the block-max WAND algorithm in Lucene, which speeds up document retrieval by skipping over documents that can't appear in the top results. It is difficult to describe exactly how block-max WAND should behave in every case, but we do know that applying it shouldn't change the top results! So our tests can generate several random search queries, then run them both with and without the WAND optimization and check that their results always match.
An important aspect of these tests is that they generate random inputs on which to run the comparison. This can help exercise cases you wouldn't have thought of, and surface unexpected issues. As an example, Lucene's randomized comparison test for BM25F scoring has helped catch bugs in subtle edge cases. The idea of feeding an algorithm random inputs is closely related to the concept of fuzzing, a common testing technique in computer security.
Elasticsearch and Lucene frequently use this testing approach. If you see a test that mentions a "duel" between two algorithms (TestDuelingAnalyzers, testDuelTermsQuery...), then you know this strategy is in action.
Use the paper's terminology
When another developer works with your code, they'll need to consult the paper to follow its details. The comment on Elasticsearch's HyperLogLog++ implementation says it well: “Trying to understand what this class does without having read the paper is considered adventurous.” This method comment also sets a good example. It includes a link to the academic paper, and highlights what modifications were made to the algorithm as it was originally described.
Since developers will base their understanding of the code on the paper, it's helpful to use the exact same terminology. Since mathematical notation is terse, this can result in names that would not usually be considered “good style”, but are very clear in the context of the paper. Formulas from academic papers are one of the few times you'll encounter cryptic variable names in Elasticsearch like rS and muBarSInverse.
The author's recommended way of reading a paper: with a large coffee.
You can email the author
When working through a tough paper, you may spend hours puzzling over a formula, unsure if you're misunderstanding or if there's just a typo. If this were an open source project, you could ask a question on GitHub or StackOverflow. But where can you turn for an academic paper? The authors seem busy and might be annoyed by your emails.
On the contrary, many academics love hearing that their ideas are being put into practice and are happy to answer questions over email. If you work on a product they're familiar with, they might even list the application on their website!
There's also a growing trend for academics to discuss papers in the open, using many of the same tools from software development. If a paper has an accompanying software package, you might find answers to common questions on Github. Stack Exchange communities like “Theoretical Computer Science” and “Cross Validated” also contain detailed discussions about popular papers. Some conferences have begun to publish all paper reviews online. These reviews contain back-and-forth discussions with the authors that can surface helpful insights about the approach.
To be continued
This post focuses on the basics of choosing an academic paper and implementing it correctly, but doesn't cover all aspects of actually deploying the algorithm. For example, if the algorithm is just one component in a complex system, how do we ensure that changes to the component lead to end-to-end improvements? And what if integrating the algorithm requires substantial modifications or extensions that the original paper doesn't cover? These are important topics we hope to share more about in future posts.
Ready to try this out on your own? Start a free trial.
Looking to build RAG into your apps? Want to try different LLMs with a vector database?
Check out our sample notebooks for LangChain, Cohere and more on Github, and join Elasticsearch Relevance Engine training now.
Related content
October 7, 2024
Comparing ELSER for retrieval relevance on the Hugging Face MTEB Leaderboard
This blog compares ELSER for retrieval relevance on the Hugging Face MTEB Leaderboard.
September 19, 2024
Evaluating search relevance part 2 - Phi-3 as relevance judge
Using the Phi-3 language model as a relevance judge, with tips & techniques to improve the agreement with human-generated annotation
July 16, 2024
Evaluating search relevance part 1 - The BEIR benchmark
Learn to evaluate your search system in the context of better understanding the BEIR benchmark, with tips & techniques to improve your search evaluation processes.
May 3, 2024
Evaluating scalar quantization in Elasticsearch
Learn how scalar quantization can be used to reduce the memory footprint of vector embeddings in Elasticsearch through an experiment.
April 25, 2024
Understanding Int4 scalar quantization in Lucene
This blog explains how int4 quantization works in Lucene, how it lines up, and the benefits of using int4 quantization.