8 de abril de 2015How to

What is an Apache Lucene Codec?

You've likely heard that Apache Lucene uses something called a codec to read and write index files. What does that really mean?

The codec is a concrete instance of the abstract org.apache.lucene.codecs.Codec class. Each codec has a unique string name, such as “Lucene410", and implements methods to return a separate Format class for each part of Lucene's index. The codec's name is registered with Java's Service Provider Interface (SPI) so you can easily get the Codec instance at any time from just its name.

Whenever Lucene needs to access the index, it does so entirely through the codec APIs. This is a vital core abstraction: it isolates all the other complex logic of searching and indexing from the low-level details of how data structures are stored on disk and in RAM.

This was a big step forward for Lucene because it presents a much lower barrier to research and innovation in the index file formats than before when the bit-level encoding details were scattered throughout the code base.

Each format exposed by the codec in turn provides reading APIs, used at search-time, and writing APIs, used during indexing. The codec is set per segment and every segment is free to use a different codec, though that's uncommon. A codec defines these 9 formats:

StoredFieldsFormat encodes each document's stored fields. The default uses block compression, and there have been many improvements recently, especially around compression, merge and CheckIndex performance.
TermVectorsFormat encodes per-document term vectors. It also uses block compression.
LiveDocsFormat handles the optional bitset marking which documents are not deleted.
NormsFormat encodes per-document and per-indexed-field index-time score normalization factors as encoded by the similarity. It typically contains a compact encoding of the field's length and any field level boosting. Recent improvements include better compression, handling sparse field cases, and not loading all values into heap during merge and CheckIndex.
SegmentInfoFormat saves metadata about the segment, such as its name, how many documents it has, its unique id, etc.
FieldInfosFormat records per-field metadata, including the field's name, whether and how it was indexed, stored, etc.
PostingsFormat covers the inverted index, including all fields, terms, documents, positions, payloads and offsets. Queries use this API to find their matching documents.
DocValuesFormat holds the column-stride per-document values, and the default format is largely disk-based with a small in-memory index for faster random access.
CompoundFormat is used when compound file format is enabled (the default). It collapses separate index files, written by all of the above formats, into two files per segment, to reduce file handles required during searching.

Testing a Codec

When you create a codec you don't have to implement all nine of these formats! Rather, you would typically use pre-existing formats for all parts except the one you are testing, and even for that one, you would start with source code from an existing implementation and tweak from there.

Often it's information retrieval researchers or Lucene developers who experiment with new codecs, using the pluggable infrastructure to try out new ideas, test performance tradeoffs, etc.

Speaking of testing, by default Lucene's test framework randomly picks a codec and formats from what's available in the classpath. This is a very powerful way to ferret out any bugs in our default and experimental codecs since Lucene's tests are quite stressful and extensive.

You can use this for your own codecs too. Perhaps you think you've fully debugged your shiny new high performance postings format and you're ready to submit a patch to Lucene, but before you do that, try running all Lucene's tests with your new format.

This is easy: run a top-level ant test -Dtests.codec=MyCodec. You'll also have to ensure your codec's class files are on ant's classpath. If you see any exotic test failures that go away when you stop using your codec, roll your sleeves up and start debugging!

Experimental Codecs and Backwards Compatibility

For each Lucene release, the default codec (e.g. Lucene50 for the 5.0.0 release) is the only one that is guaranteed to have backwards compatibility through the next major Lucene version. All other codecs and formats are experimental, meaning their format is free to change in incompatible and even undocumented ways on every release. Not only are they free to change, but they frequently do!

Lucene uses the codec API to implement backwards compatibility, by keeping all codecs for reading (but not writing!) old indices in the backward-codecs module. If you look in that module you'll see a number of codecs to handle reading each of the major format changes that took place during Lucene's 4.x development.

Because experimental formats are inherently unstable, they are not exposed in Elasticsearch, and are instead used by Lucene developers to experiment with new ideas.

One particularly exotic experimental codec is SimpleTextCodec, described briefly here. It encodes absolutely everything in plain text files that you can browse with any text editor. This is a powerful way to learn what Lucene actually stores in each part of the index: just index a few documents and then go open your index in any text editor! However please do not attempt to use this for anything but the most trivial test indices: it is obviously very space and time consuming, and uses inefficient yet approachable implementations.

DirectDocValuesFormat and DirectPostingsFormat store doc values and postings entirely in heap, uncompressed as flat native java arrays. These formats are exceptionally wasteful of RAM! MemoryDocValuesFormat also stores all doc values in heap, but in a more compressed fashion. There are also formats that store the entire terms dictionary as a finite-state transducer, versus the default postings format which only stores the terms index in heap and must seek and scan a term block to find a given term.

Even though the experimental formats have no backwards compatibility, and are thus not suitable for production use, over time their compelling features tend to find their way into the default format. For example, PulsingPostingsFormat, which inlines all postings for single-occurrence terms into the terms dictionary so primary key lookups can save a likely otherwise costly disk seek, is now folded into the default BlockPostingsFormat. Similarly, we used to have dedicated formats for compressing stored fields, and this has now been folded into Lucene50Codec as different compression modes (best compression vs. best speed). Another example is primarily disk-based doc values formats, which used to be an experimental format but are now the default, to reduce heap required for large indices.

One important property of all codecs is they must encode into the index everything necessary to instantiate themselves at read-time. They must be “self describing," neither requiring nor accepting any further parameters at read time. This means Lucene simply reads the codec's name out of the segments file, retrieves the codec instance using Java's SPI, and then uses that codec instance to access everything.

Per-field control for Doc Values and Postings

Because the postings and doc values formats are especially important, and there can be substantial variation across fields, these two formats each have a special per-field format whose purpose is to let separate fields within a single segment have different formats. The default Lucene50Codec uses these per-field formats and has two protected methods that you can override during indexing to decide which field uses which format.

Furthermore, the write API for these two formats is particularly flexible. Originally these APIs were similar to XML's SAX API where a single pass was “pushed" down to the format. But this was awkward, especially for formats that needed more than one pass through the data to determine the best encoding. We've since switched to a “pull" API, somewhat analogous to XML's DOM API such that the format implementation is free to sweep multiple times through the doc values or postings in order to improve its encoding.

Most users will never need to concern themselves with the implementation details under Lucene's codec APIs, because the default codec works very well in general. But if you are an adventurous and innovative developer eager to explore possible improvements, just know that it is easy to create and plug in your own fun modular codec formats!