30 janvier 2015 Technique

Lucene's Handling of Deleted Documents

Par Michael McCandless

When a document is deleted or updated (= delete + add), Apache Lucene simply marks a bit in a per-segment bitset to record that the document is deleted. All subsequent searches simply skip any deleted documents.

It is not until segments are merged that the bytes consumed by deleted documents are reclaimed. Likewise, any terms that occur only in deleted documents (ghost terms) are not removed until merge. This approach is necessary because it would otherwise be far too costly to update Lucene's write-once index data structures and aggregate statistics for every document deletion, but it has some implications:

  • Deleted documents tie up disk space in the index.
  • In-memory per-document data structures, such as norms or field data, will still consume RAM for deleted documents.
  • Search throughput is lower, since each search must check the deleted bitset for every potential hit. More on this below.
  • Aggregate term statistics, used for query scoring, will still reflect deleted terms and documents. When a merge completes, the term statistics will suddenly jump closer to their true values, changing hit scores. In practice this impact is minor, unless the deleted documents had divergent statistics from the rest of the index.
  • A deleted document ties up a document ID from the maximum 2.1 B documents for a single shard. If your shard is riding close to that limit (not recommended!) this could matter.
  • Fuzzy queries can have slightly different results, because they may match ghost terms.

Merging Reclaims Deleted Documents

Lucene's default merge policy, TieredMergePolicy, already prefers merges that would reclaim more deleted documents, other factors being equal. Over time this means segments with more deletions will be targeted for merging. While it does have a tunable setting (index.merge.policy.reclaim_deletes_weight) to control how aggressively it targets deletions, it is dangerous to increase this too much otherwise it could select poor (costly) merge choices, dwarfing any gains from slightly fewer deleted documents.

I was curious how effective its defaults are in practice, so I ran a simple worst-case indexing test. First, I built an initial index with 100 M added documents (no deletions) derived from Wikipedia's English export. Then I updated that index by forever randomly replacing an existing document (never adding a new document), so that every add also incurs a deletion.

There was no pattern to the updates, such as favoring replacing older or newer documents. This is unrealistic, but it is a good worst case test because the deletes accumulate uniformly, in proportion to each segment's size. In real usage, certain segments (old or new) would accumulate deletions at a faster rate and thus be more quickly selected for merging.

I measured the percentage of deleted (but not yet merged away) documents over time, computed as maxDoc/numDocs - 1.0 (where numDocs is constant at 100 M in my test). The graph below shows an initial startup transient, when the percentage quickly rise from 0% to 45% at which point a couple of large merges complete and bring it back down. After that the deletions percentage hovers between 35% and 60%, with a sawtooth shape showing a sudden drop whenever varying sized merges finish. It looks somewhat like the stock market!

A maximum sized segment (default: 5 GB) will only be eligible for merging once it accumulates 50% deletions. If this is too slow for your usage, try decreasing that maximum (index.merge.policy.max_merged_segment): this will result in a somewhat larger segment count, but the reclaiming should happen more quickly, especially when there is a pattern to the deletions.

How Do Deleted Documents Affect Search Performance?

Because deleted documents remain in the index, they must still be decoded from the postings lists and then skipped during searching, so there is added search cost. To test how much, I ran a search performance test for varying queries using the 100 M document index with no deletions as the baseline, and the same index with 50% deleted documents (i.e., 150 M documents with 50M deleted). Both indices were single-segment. Here are the results:

Query QPS     StdDev     QPS with deletes     StdDev with deletes     % change
Int Range query 1.2 (5.1%) 0.6 (1.8%) 46%
Prefix query 5.7 (5.0%) 3.4 (2.3%) 41%
Wildcard 5.3 (4.4%) 3.2 (2.2%) 39%
And High+Low 91.1 (2.0%) 59.5 (2.1%) 34%
Med Phrase 36.2 (2.8%) 24.4 (1.3%) 32%
And High+Med 16.6 (1.5%) 11.2 (1.0%) 32%
Med Term 12.6 (2.6%) 8.6 (6.1%) 31%
And High+High 4.4 (1.3%) 3.0 (0.9%) 31%
High Term 6.1 (2.8%) 4.2 (6.1%) 31%
Fuzzy1 33.5 (12.7%) 23.6 (8.1%) 29%
Low Term 61.1 (6.3%) 43.6 (7.1%) 28%
Med Sloppy Phrase 7.3 (4.4%) 5.2 (1.7%) 28%
Fuzzy2 33.7 (13.3%) 24.2 (8.5%) 28%
Or High+Med 6.8 (5.4%) 4.9 (4.5%) 27%
Or High+Low 5.7 (5.6%) 4.1 (4.7%) 27%
Low Phrase 8.3 (2.9%) 6.0 (1.6%) 27%
Or High+High 1.5 (5.5%) 1.1 (4.7%) 26%
High Phrase 2.1 (5.1%) 1.5 (2.8%) 25%
Med Span Near 15.8 (9.3%) 11.8 (3.8%) 25%
Low Sloppy Phrase 2.7 (3.2%) 2.0 (1.9%) 25%
Low Span Near 3.9 (4.8%) 3.2 (2.7%) 18%
High Sloppy Phrase     2.8 (5.9%) 2.3 (4.6%) 18%
High Span Near 2.4 (4.4%) 2.0 (2.5%) 18%

The bad news is there is clearly a non-trivial performance cost to deleted documents, and this is something we can work to reduce over time (patches welcome!). The good news is the cost is typically quite a bit lower than the percentage deletes (50% in this test) because these documents are filtered out at a low level before any of the costly query matchers and scorers see them. The more costly queries (Phrase, Span) tend to see the lowest impact, which is also good because it is the slow queries that determine node capacity for most applications.

How About Expunge Deletes?

Elasticsearch's optimize API accepts an only_expunge_deletes flag, which in turn calls Lucene's IndexWriter.expungeDeletes method. While this will forcefully reclaim space from deleted documents, this operation is very costly: under the hood, it forces merging of any segments that have more than 10% (by default) deletions. Use it sparingly: it is better to let Lucene's natural merging handle reclaiming deletions.

However, if you have an index which receives only deletions (never an added or updated document) then beware that Lucene in this case currently fails to kick off any merges. This is a known issue that has been fixed, and will be fixed in Lucene 5.0 and Elasticsearch 2.0. In the meantime, this is an appropriate time to periodically expunge deletes!

Time-Based Indices

Elasticsearch lets you specify time-to-live for each added document, which means after that time has passed, the document is automatically deleted. This is very useful for certain applications, but it will cause heavy deletions over time.

One simple optimization Lucene uses, that may help in such use cases, is to simply drop a segment once it has accumulated 100% deleted documents, without waiting for it to be merged away. The optimization is somewhat fragile since it only applies when all documents in the segment were deleted, but it is very effective since it is obviously extremely fast and happens before merging. Unfortunately, because TieredMergePolicy picks out of order merges, it reduces how frequently the optimization can apply in time-to-live indices.

If you need to further improve indexing performance with time-to-live documents consider using time-based indices instead, such as one index per day or per week: dropping an entire index is quite a bit more efficient than having Lucene remove a subset of documents. If you are concerned about the loss of granularity with this approach, just add a filter to the request to remove the oldest results from the oldest index.

If you are curious about how many deleted documents are in your shards, use the indices segments API to find out. Just don't read too much into it. Overall, besides perhaps decreasing the maximum segment size, it is best to leave Lucene's defaults as-is and not fret too much about when deletes are reclaimed.