Understanding the approximate nearest neighbor (ANN) algorithm

Neighbor.jpg

If you grew up in a time before the internet made its debut, you’ll remember it wasn’t always easy to find new things to like. We discovered new bands when we happened to hear them on the radio, we’d see a new TV show by accident because we forgot to change the channel, and we’d find a new favorite video game based almost entirely on the picture on the cover. 

Nowadays, things are very different. Spotify will point me to artists that match my tastes, Netflix will highlight movies and TV shows it knows we’ll enjoy, and Xbox knows what we’ll probably want to play next. These recommendation systems make it so much easier for us to find the things we’re actually looking for, and they’re powered by nearest neighbor (NN) algorithms. NN looks at the extensive sea of information it has available and identifies the closest thing to something you like, or something you’re searching for.

But NN algorithms have an inherent flaw. If the amount of data they’re analyzing gets too big, crawling through every option takes forever. This is a problem, especially as these data sources get bigger and bigger every year. This is where approximate nearest neighbor (ANN) grabs the baton from NN and changes the game.

In this article, we’ll cover the following key topics about ANN:

  • ANN definition

  • How ANN works

  • When to use ANN search

  • ANN importance in vector search

  • Various types of ANN algorithms

Approximate nearest neighbor explained

Approximate nearest neighbor (ANN) is an algorithm that finds a data point in a data set that's very close to the given query point, but not necessarily the absolute closest one. An NN algorithm searches exhaustively through all the data to find the perfect match, whereas an ANN algorithm will settle for a match that’s close enough.

This might sound like a worse solution, but it’s actually the key to nailing fast similarity search. ANN uses intelligent shortcuts and data structures to efficiently navigate the search space. So instead of taking up huge amounts of time and resources, it can identify data points with much less effort that are close enough to be useful in most practical scenarios.

Essentially, it’s a trade-off. If you absolutely need to find the one best match, you can do that at the expense of speed and performance with NN. But if you can tolerate a tiny drop in accuracy, ANN is almost always a better solution.

How approximate nearest neighbor algorithms work

The first part of how ANN works is dimensionality reduction, where the goal is to turn a higher-dimensional data set into a lower-dimensional one. The aim is to make the predictive model task less complicated and more efficient than having to analyze all the data.

These algorithms rest on the mathematical concept of metric spaces — where data points reside and distances between them are defined. These distances must adhere to specific rules (non-negativity, identity, symmetry, triangle inequality), and common functions like Euclidean distance or cosine similarity are used to calculate them. 

To better understand this, imagine you’re on holiday searching for the villa you’ve rented. Instead of checking every single building one-by-one (higher-dimensional), you’d use a map, which reduces the problem into two dimensions (lower-dimensional). (This is a deliberately simplistic example. Dimensionality reduction is not the sole method employed by ANN algorithms to improve efficiency.)

ANN algorithms also leverage clever data structures called indexes to improve efficiency. By pre-processing the data into these indexes, ANN can navigate the search space much quicker. Think of these as street signs, helping you find where you are on the map to reach your holiday villa quicker.

Types of approximate nearest neighbor algorithms

While the concept of ANN offers a compelling speed advantage in search, this term actually covers a diverse toolbox of algorithms. They all have their own strengths and trade-offs, and understanding these nuances is critical when choosing the right tool for your specific data and search needs.

KD-trees

KD-trees organize data points in a hierarchical tree structure, partitioning the space based on specific dimensions. This enables fast and efficient search in low-dimensional spaces and Euclidean distance-based queries.

But while KD-trees excel at finding nearest neighbors in low dimensions, they suffer from the “curse of dimensionality.” This is where, as the number of dimensions increases, the space between points explodes. In these high dimensions, KD-trees' strategy of splitting based on single axes becomes ineffective. This makes the search examine most of the data, losing the efficiency advantage and approaching the slowness of a simple linear scan through all points.

Locality-sensitive hashing (LSH)

LSH is a powerful ANN technique that works by "hashing" data points into lower-dimensional spaces in a way that cleverly preserves their similarity relationships. This clustering makes them easier to find, and it allows LSH to excel in searching massive, high-dimensional data sets like images or text with both speed and scalability. And it does all this while still returning "close enough" matches with good accuracy. But keep in mind that LSH might also occasionally produce false positives (finding non-similar points as similar), and its effectiveness can vary based on the distance metric and data type. There are various LSH families designed to work with different metrics (e.g., Euclidean distance, Jaccard similarity), which means LSH remains versatile.

Annoy

Annoy (Approximate Nearest Neighbors Oh Yeah) isn't a single algorithm, but an open-source C++ library that uses its own algorithms for building and querying trees, without directly implementing LSH or KD-trees. It's designed for memory-efficient and fast search in high-dimensional spaces, making it suitable for real-time queries. Essentially, it’s a user-friendly interface offering flexibility for different data types and search scenarios. Annoy's strength lies in leveraging multiple ANN approaches under one roof, allowing you to choose the best fit for your needs. While it simplifies the process, remember that picking the right internal algorithm within Annoy is crucial for optimal performance, and its effectiveness still depends on factors like your data and accuracy requirements. 

Linear scan algorithm

Although not typically classified as an ANN technique, it’s worth mentioning linear scan because it’s a brute-force approach that gives you similar results to other ANN algorithms. It iterates through every data point sequentially, calculating the distances between records and keeping track of the best matches. Because of the simplistic nature of the algorithm, it’s easy to implement and great for small data sets. The downside of the more basic approach is that it’s inefficient for large data sets, slow when used with high-dimensional data, and impractical for real-time applications.

Choosing the right ANN

Before you dive into picking an ANN, there are a few things you should consider before deciding:

  • Data set size and dimensionality: Consider using locality-sensitive hashing for large and high-dimensional data and KD-trees for smaller and lower-dimensional data.

  • Desired accuracy level: If absolute precision is vital, linear scan is likely the best option — otherwise, look at LSH or Annoy for good accuracy with speed.

  • Computational resources: Annoy offers flexibility, but consider memory and processing limitations before choosing an algorithm within it.

Remember — there's no one-size-fits-all solution. Experiment with different ANN algorithms and evaluate their performance on your specific data to find the perfect match for your vector search needs. Beyond these options, the world of ANN algorithms is constantly evolving, so it’s also worth keeping an ear to the ground so you don’t miss something new that could improve your search.

What you should do next

Whenever you're ready, here are four ways we can help you harness insights from your business’ data:

  1. Start a free trial and see how Elastic can help your business.

  2. Tour our solutions to see how the Elasticsearch Platform works and how our solutions will fit your needs.

  3. Discover how to incorporate generative AI in the enterprise.

  4. Share this article with someone you know who'd enjoy reading it. Share it with them via email, LinkedIn, Twitter, or Facebook.

The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.

In this blog post, we may have used or referred to third party generative AI tools, which are owned and operated by their respective owners. Elastic does not have any control over the third party tools and we have no responsibility or liability for their content, operation or use, nor for any loss or damage that may arise from your use of such tools. Please exercise caution when using AI tools with personal, sensitive or confidential information. Any data you submit may be used for AI training or other purposes. There is no guarantee that information you provide will be kept secure or confidential. You should familiarize yourself with the privacy practices and terms of use of any generative AI tools prior to use. 

Elastic, Elasticsearch, ESRE, Elasticsearch Relevance Engine and associated marks are trademarks, logos or registered trademarks of Elasticsearch N.V. in the United States and other countries. All other company and product names are trademarks, logos or registered trademarks of their respective owners.