How to

Celebrating 20 years of Apache Lucene: Features

Apache Lucene turned 20 this year, and to celebrate, we've reached out to folks involved in the project to talk about its past, present, and future. In this week's blog, we're going to take a look at some of the amazing features that have been implemented over the years. In addition to hearing from Lucene's founder, Doug Cutting, we'll also feature different Project Management Committee (PMC) members, committers, and contributors — highlighting their favorites. Let's start with Simon:

My favorite lucene feature is the DocValues feature, which may seem like a “must-have” and straightforward to implement. Up to a certain extent that's true, but the interesting part is the history behind it.

After the great rewrite of Lucene 4, where we moved to something called flexible indexing (which allowed us to plug-in posting formats in the beginning and eventually entire codecs, including almost everything we write to disk), we started working on a feature called Column Stride Fields. Due to the name conflict, or rather, confusion with the Compound File System (CSF vs. CFS) it was renamed to DocValues. It was a rather simple typed, column-oriented mechanism to store per-field values, and at the same time was one of the most wanted and requested features to date. We started with an iterative API and quickly moved to a more flexible random access API. It had many many iterations until we had a stable API. Everybody was convinced this was the way to go.

A couple of years later we saw many, many users of Elasticsearch suffer from that decision, since with a random access API you also need to provide fast random access. Yet, with sparse data — which is what many users of Elasticsearch ingest into the system — you end up with huge amounts of space wasted on disk. Obviously the first thing you think about is compression, but to efficiently compress sparse data we needed to rethink our API decision. We move back to an iterator based API since Lucene’s recommended access pattern is document a time, which means we can move away from random access and require users to access DocValues in internal document ID order. This gave us the ability to quickly improve the situation on the sparse data significantly and at the same time allow for fast random access if the data is dense.

This feature took about 5 to 7 years to mature. It teaches us that simple things are never simple and that revisiting a design decision is always worth it.

- Simon Willnauer, Apache Lucene PMC

Everyone has a favorite feature (or maybe a few). Let's hear from the community...

Visit our Apache Lucene timeline to see how it became the search engine that powers digital experiences worldwide.