30 janvier 2015

Lucene's Handling of Deleted Documents

When a document is deleted or updated (= delete + add), Apache Lucene simply marks a bit in a per-segment bitset to record that the document is deleted. All subsequent searches simply skip any deleted documents.

It is not until segments are merged that the bytes consumed by deleted documents are reclaimed. Likewise, any terms that occur only in deleted documents (ghost terms) are not removed until merge. This approach is necessary because it would otherwise be far too costly to update Lucene's write-once index data structures and aggregate statistics for every document deletion, but it has some implications:

Deleted documents tie up disk space in the index.
In-memory per-document data structures, such as norms or field data, will still consume RAM for deleted documents.
Search throughput is lower, since each search must check the deleted bitset for every potential hit. More on this below.
Aggregate term statistics, used for query scoring, will still reflect deleted terms and documents. When a merge completes, the term statistics will suddenly jump closer to their true values, changing hit scores. In practice this impact is minor, unless the deleted documents had divergent statistics from the rest of the index.
A deleted document ties up a document ID from the maximum 2.1 B documents for a single shard. If your shard is riding close to that limit (not recommended!) this could matter.
Fuzzy queries can have slightly different results, because they may match ghost terms.

Merging Reclaims Deleted Documents

Lucene's default merge policy, TieredMergePolicy, already prefers merges that would reclaim more deleted documents, other factors being equal. Over time this means segments with more deletions will be targeted for merging. While it does have a tunable setting (index.merge.policy.reclaim_deletes_weight) to control how aggressively it targets deletions, it is dangerous to increase this too much otherwise it could select poor (costly) merge choices, dwarfing any gains from slightly fewer deleted documents.

I was curious how effective its defaults are in practice, so I ran a simple worst-case indexing test. First, I built an initial index with 100 M added documents (no deletions) derived from Wikipedia's English export. Then I updated that index by forever randomly replacing an existing document (never adding a new document), so that every add also incurs a deletion.

There was no pattern to the updates, such as favoring replacing older or newer documents. This is unrealistic, but it is a good worst case test because the deletes accumulate uniformly, in proportion to each segment's size. In real usage, certain segments (old or new) would accumulate deletions at a faster rate and thus be more quickly selected for merging.

I measured the percentage of deleted (but not yet merged away) documents over time, computed as maxDoc/numDocs - 1.0 (where numDocs is constant at 100 M in my test). The graph below shows an initial startup transient, when the percentage quickly rise from 0% to 45% at which point a couple of large merges complete and bring it back down. After that the deletions percentage hovers between 35% and 60%, with a sawtooth shape showing a sudden drop whenever varying sized merges finish. It looks somewhat like the stock market!

A maximum sized segment (default: 5 GB) will only be eligible for merging once it accumulates 50% deletions. If this is too slow for your usage, try decreasing that maximum (index.merge.policy.max_merged_segment): this will result in a somewhat larger segment count, but the reclaiming should happen more quickly, especially when there is a pattern to the deletions.

How Do Deleted Documents Affect Search Performance?

Because deleted documents remain in the index, they must still be decoded from the postings lists and then skipped during searching, so there is added search cost. To test how much, I ran a search performance test for varying queries using the 100 M document index with no deletions as the baseline, and the same index with 50% deleted documents (i.e., 150 M documents with 50M deleted). Both indices were single-segment. Here are the results:

Query	QPS	StdDev	QPS with deletes	StdDev with deletes	% change
Int Range query	1.2	(5.1%)	0.6	(1.8%)	46%
Prefix query	5.7	(5.0%)	3.4	(2.3%)	41%
Wildcard	5.3	(4.4%)	3.2	(2.2%)	39%
And High+Low	91.1	(2.0%)	59.5	(2.1%)	34%
Med Phrase	36.2	(2.8%)	24.4	(1.3%)	32%
And High+Med	16.6	(1.5%)	11.2	(1.0%)	32%
Med Term	12.6	(2.6%)	8.6	(6.1%)	31%
And High+High	4.4	(1.3%)	3.0	(0.9%)	31%
High Term	6.1	(2.8%)	4.2	(6.1%)	31%
Fuzzy1	33.5	(12.7%)	23.6	(8.1%)	29%
Low Term	61.1	(6.3%)	43.6	(7.1%)	28%
Med Sloppy Phrase	7.3	(4.4%)	5.2	(1.7%)	28%
Fuzzy2	33.7	(13.3%)	24.2	(8.5%)	28%
Or High+Med	6.8	(5.4%)	4.9	(4.5%)	27%
Or High+Low	5.7	(5.6%)	4.1	(4.7%)	27%
Low Phrase	8.3	(2.9%)	6.0	(1.6%)	27%
Or High+High	1.5	(5.5%)	1.1	(4.7%)	26%
High Phrase	2.1	(5.1%)	1.5	(2.8%)	25%
Med Span Near	15.8	(9.3%)	11.8	(3.8%)	25%
Low Sloppy Phrase	2.7	(3.2%)	2.0	(1.9%)	25%
Low Span Near	3.9	(4.8%)	3.2	(2.7%)	18%
High Sloppy Phrase	2.8	(5.9%)	2.3	(4.6%)	18%
High Span Near	2.4	(4.4%)	2.0	(2.5%)	18%

The bad news is there is clearly a non-trivial performance cost to deleted documents, and this is something we can work to reduce over time (patches welcome!). The good news is the cost is typically quite a bit lower than the percentage deletes (50% in this test) because these documents are filtered out at a low level before any of the costly query matchers and scorers see them. The more costly queries (Phrase, Span) tend to see the lowest impact, which is also good because it is the slow queries that determine node capacity for most applications.

How About Expunge Deletes?

Elasticsearch's optimize API accepts an only_expunge_deletes flag, which in turn calls Lucene's IndexWriter.expungeDeletes method. While this will forcefully reclaim space from deleted documents, this operation is very costly: under the hood, it forces merging of any segments that have more than 10% (by default) deletions. Use it sparingly: it is better to let Lucene's natural merging handle reclaiming deletions.

However, if you have an index which receives only deletions (never an added or updated document) then beware that Lucene in this case currently fails to kick off any merges. This is a known issue that has been fixed, and will be fixed in Lucene 5.0 and Elasticsearch 2.0. In the meantime, this is an appropriate time to periodically expunge deletes!

Time-Based Indices

Editor's Note (June 12, 2018): As you read, please note that the content in this section is relevant for Elasticsearch 2.4. With Elasticsearch 5.x the ttl field was removed.

Elasticsearch lets you specify time-to-live for each added document, which means after that time has passed, the document is automatically deleted. This is very useful for certain applications, but it will cause heavy deletions over time.

One simple optimization Lucene uses, that may help in such use cases, is to simply drop a segment once it has accumulated 100% deleted documents, without waiting for it to be merged away. The optimization is somewhat fragile since it only applies when all documents in the segment were deleted, but it is very effective since it is obviously extremely fast and happens before merging. Unfortunately, because TieredMergePolicy picks out of order merges, it reduces how frequently the optimization can apply in time-to-live indices.

If you need to further improve indexing performance with time-to-live documents consider using time-based indices instead, such as one index per day or per week: dropping an entire index is quite a bit more efficient than having Lucene remove a subset of documents. If you are concerned about the loss of granularity with this approach, just add a filter to the request to remove the oldest results from the oldest index.

If you are curious about how many deleted documents are in your shards, use the indices segments API to find out. Just don't read too much into it. Overall, besides perhaps decreasing the maximum segment size, it is best to leave Lucene's defaults as-is and not fret too much about when deletes are reclaimed.

Ingénierie du contexte

Base vectorielle

Applications optimisées pour la recherche

Logs

Protection contre les menaces

Workflows

Elasticsearch

Kibana (Discover, tableaux de bord)

Elastic Agent Builder

AutoOps

Langage de requête canalisé

Modèles de recherche Jina AI

Elastic Cloud Serverless

Elastic Cloud hébergé

Elasticsearch autogéré

Recherche sur les sites d'e-commerce

Recherche dans le service client

Applications axées sur la recherche

Analyse des logs

Suivi d'infrastructure

Suivi de l'expérience numérique

App : suivi des performances

AIOps

Observabilité des LLM

SIEM nouvelle génération

Workflows pour la sécurité

XDR et sécurité aux points de terminaison

L'IA pour la sécurité

Décuplez la valeur de vos données

Fournisseurs cloud

Écosystème IA d'Elastic

Programme de partenariat Search AI

AV-Comparatives

Forrester Wave™ XDR

Leader dans le Magic Quadrant de Gartner

IDC MarketScape

Recherche

Security

Observability

Lancez-vous

Galerie de démonstrations

Téléchargements

Intégrations

Documentation

Elasticsearch Labs

Elastic Security Labs

Elastic Observability Labs

Blog

Communauté

Événements

Webinars

Discussion

Formation

Support technique

Conseil

Lucene's Handling of Deleted Documents

Merging Reclaims Deleted Documents

How Do Deleted Documents Affect Search Performance?

How About Expunge Deletes?

Time-Based Indices