26 mai 2015

A Dive into the Elasticsearch Storage

By Njal Karevoll

UPDATE: This article refers to our hosted Elasticsearch offering by an older name, Found. Please note that Found is now known as Elastic Cloud.

In this article we'll investigate the files written to the data directory by various parts of Elasticsearch. We will look at node, index and shard level files and give a short explanation of their contents in order to establish an understanding of the data written to disk by Elasticsearch.

Elasticsearch Paths

Elasticsearch is configured with several paths:

  • path.home: Home directory of the user running the Elasticsearch process. Defaults to the Java system property user.dir, which is the default home directory for the process owner.
  • path.conf: A directory containing the configuration files. This is usually set by setting the Java system property es.config, as it naturally has to be resolved before the configuration file is found.
  • path.plugins: A directory whose sub-folders are Elasticsearch plugins. Sym-links are supported here, which can be used to selectively enable/disable a set of plugins for a certain Elasticsearch instance when multiple Elasticsearch instances are run from the same executable.
  • path.work: A directory that was used to store working/temporary files for Elasticsearch. It’s no longer used.
  • path.logs: Where the generated logs are stored. It might make sense to have this on a separate volume from the data directory in case one of the volumes runs out of disk space.
  • path.data: Path to a folder containing the data stored by Elasticsearch.

In this article, we’ll have a closer look at the actual contents of the data directory (path.data) and try to gain an understanding of what all the files are used for.

Where Do the Files Come from?

Since Elasticsearch uses Lucene under the hood to handle the indexing and querying on the shard level, the files in the data directory are written by both Elasticsearch and Lucene.

The responsibilities of each is quite clear: Lucene is responsible for writing and maintaining the Lucene index files while Elasticsearch writes metadata related to features on top of Lucene, such as field mappings, index settings and other cluster metadata – end user and supporting features that do not exist in the low-level Lucene but are provided by Elasticsearch.

Let’s look at the outer levels of data written by Elasticsearch before we dive deeper and eventually find the Lucene index files.

Node Data

Simply starting Elasticsearch from a empty data directory yields the following directory tree:

$ tree data
data
└── elasticsearch
    └── nodes
        └── 0
            ├── _state
            │   └── global-0.st
            └── node.lock

The node.lock file is there to ensure that only one Elasticsearch installation is reading/writing from a single data directory at a time.

More interesting is the global-0.st-file. The global-prefix indicates that this is a global state file while the .st extension indicates that this is a state file that contains metadata. As you might have guessed, this binary file contains global metadata about your cluster and the number after the prefix indicates the cluster metadata version, a strictly increasing versioning scheme that follows your cluster.

While it is technically possible to edit these files with an hex editor in an emergency, it is strongly discouraged because it can quickly lead to data loss.

Index Data

Let’s create a single shard index and look at the files changed by Elasticsearch:

$ curl localhost:9200/foo -XPOST -d '{"settings":{"index.number_of_shards": 1}}'
{"acknowledged":true}

$ tree -h data
data
└── [ 102]  elasticsearch
    └── [ 102]  nodes
        └── [ 170]  0
            ├── [ 102]  _state
            │   └── [ 109]  global-0.st
            ├── [ 102]  indices
            │   └── [ 136]  foo
            │       ├── [ 170]  0
            │       │   ├── .....
            │       └── [ 102]  _state
            │           └── [ 256]  state-0.st
            └── [   0]  node.lock

We see that a new directory has been created corresponding to the index name. This directory has two sub-folders: _state and 0. The former contains what’s called a index state file (indices/{index-name}/_state/state-{version}.st), which contains metadata about the index, such as its creation timestamp. It also contains a unique identifier as well as the settings and the mappings for the index. The latter contains data relevant for the first (and only) shard of the index (shard 0). Next up, we’ll have a closer look at this.

Shard Data

The shard data directory contains a state file for the shard that includes versioning as well as information about whether the shard is considered a primary shard or a replica.

$ tree -h data/elasticsearch/nodes/0/indices/foo/0
data/elasticsearch/nodes/0/indices/foo/0
├── [ 102]  _state
│   └── [  81]  state-0.st
├── [ 170]  index
│   ├── [  36]  segments.gen
│   ├── [  79]  segments_1
│   └── [   0]  write.lock
└── [ 102]  translog
    └── [  17]  translog-1429697028120

In earlier Elasticsearch versions, separate {shard_id}/index/_checksums- files (and .cks-files) were also found in the shard data directory. In current versions these checksums are now found in the footers of the Lucene files instead, as Lucene has added end-to-end checksumming for all their index files.

The {shard_id}/index directory contains files owned by Lucene. Elasticsearch generally does not write directly to this folder (except for older checksum implementation found in earlier versions). The files in these directories constitute the bulk of the size of any Elasticsearch data directory.

Before we enter the world of Lucene, we’ll have a look at the Elasticsearch transaction log, which is unsurprisingly found in the per-shard translog directory with the prefix translog-. The transaction log is very important for the functionality and performance of Elasticsearch, so we’ll explain its use a bit closer in the next section.

Per-Shard Transaction Log

The Elasticsearch transaction log makes sure that data can safely be indexed into Elasticsearch without having to perform a low-level Lucene commit for every document. Committing a Lucene index creates a new segment on the Lucene level which is fsync()-ed and results in a significant amount of disk I/O which affects performance.

In order to accept a document for indexing and make it searchable without requiring a full Lucene commit, Elasticsearch adds it to the Lucene IndexWriter and appends it to the transaction log. After each refresh_interval it will call reopen() on the Lucene indexes, which will make the data searchable without requiring a commit. This is part of the Lucene Near Real Time API. When the IndexWriter eventually commits due to either an automatic flush of the transaction log or due to an explicit flush operation, the previous transaction log is discarded and a new one takes its place.

Should recovery be required, the segments written to disk in Lucene will be recovered first, then the transaction log will be replayed in order to prevent the loss of operations not yet fully committed to disk.

Lucene Index Files

Lucene has done a good job at documenting the files in the Lucene index directory, reproduced here for your convenience (the linked documentation in Lucene also goes into detail about the changes these files have gone through since all the way back to Lucene 2.1, so check it out):

Name Extension Brief Description
Segments File segments_N Stores information about a commit point
Lock File write.lock The Write lock prevents multiple IndexWriters from writing to the same file.
Segment Info .si Stores metadata about a segment
Compound File .cfs, .cfe An optional “virtual” file consisting of all the other index files for systems that frequently run out of file handles.
Fields .fnm Stores information about the fields
Field Index .fdx Contains pointers to field data
Field Data .fdt The stored fields for documents
Term Dictionary .tim The term dictionary, stores term info
Term Index .tip The index into the Term Dictionary
Frequencies .doc Contains the list of docs which contain each term along with frequency
Positions .pos Stores position information about where a term occurs in the index
Payloads .pay Stores additional per-position metadata information such as character offsets and user payloads
Norms .nvd, .nvm Encodes length and boost factors for docs and fields
Per-Document Values .dvd, .dvm Encodes additional scoring factors or other per-document information.
Term Vector Index .tvx Stores offset into the document data file
Term Vector Documents .tvd Contains information about each document that has term vectors
Term Vector Fields .tvf The field level info about term vectors
Live Documents .liv Info about what files are live

Often, you’ll also see a segments.gen file in the Lucene index directory, which is a helper file that contains information about the current/latest segments_N file and is used for filesystems that might not return enough information via directory listings to determine the latest generation segments file.

In older Lucene versions you’ll also find files with the .del suffix. These serve the same purpose as the Live Documents (.liv) files – in other words, these are the deletion lists. If you’re wondering what all this talk about Live Documents and deletion lists are about, you might want to read up on it in the section about building indexes in our Elasticsearch from the bottom-up article.

Fixing Problematic Shards

Since an Elasticsearch shard contains a Lucene Index, we can use Lucene’s wonderful CheckIndex tool, which enables us to scan and fix problematic segments with usually minimal data loss. We would generally recommend Elasticsearch users to simply re-index the data, but if for some reason that’s not possible and the data is very important, it’s a route that’s possible to take, even if it requires quite a bit of manual work and time, depending on the number of shards and their sizes.

The Lucene CheckIndex tool is included in the default Elasticsearch distribution and requires no additional downloads.
# change this to reflect your shard path, the format is
# {path.data}/{cluster_name}/nodes/{node_id}/indices/{index_name}/{shard_id}/index/

$ export SHARD_PATH=data/elasticsearch/nodes/0/indices/foo/0/index/
$ java -cp lib/elasticsearch-*.jar:lib/*:lib/sigar/* -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex $SHARD_PATH

If CheckIndex detects a problem and its suggestion to fix it looks sensible, you can tell CheckIndex to apply the fix(es) by adding the -fix command line parameter.

Storing Snapshots

You might wonder how all these files translate into the storage used by the snapshot repositories. Wonder no more: taking this cluster, snapshotting it as my-snapshot to a filesystem based gateway and inspecting the files in the repository we’ll find these files (some files omitted for brevity):

$ tree -h snapshots
snapshots
├── [  31]  index
├── [ 102]  indices
│   └── [ 136]  foo
│       ├── [1.2K]  0
│       │   ├── [ 350]  __0
│       │   ├── [1.8K]  __1
...
│       │   ├── [ 350]  __w
│       │   ├── [ 380]  __x
│       │   └── [8.2K]  snapshot-my-snapshot
│       └── [ 249]  snapshot-my-snapshot
├── [  79]  metadata-my-snapshot
└── [ 171]  snapshot-my-snapshot

At the root we have an index file that contains information about all the snapshots in this repository and each snapshot has an associated snapshot- and a metadata- file. The snapshot- file at the root contains information about the state of the snapshot, which indexes it contains and so on. The metadata- file at the root contains the cluster metadata at the time of the snapshot.

When compress: true is set, metadata- and snapshot- files are compressed using LZF, which focuses on compressing and decompressing speed, which makes it a great fit for Elasticsearch. The data is stored with a header: ZV + 1 byte indicating whether the data is compressed. After the header there will be one or more compressed 64K blocks on the format: 2 byte block length + 2 byte uncompressed size + compressed data. Using this information you can use any LibLZF compatible decompressor. If you want to learn more about LZF, check out this great description of the format.

At the index level there is another file, indices/{index_name}/snapshot-{snapshot_name} that contains the index metadata, such as settings and mappings for the index at the time of the snapshot.

At the shard level you’ll find two kinds of files: renamed Lucene index files and the shard snapshot file: indices/{index_name}/{shard_id}/snapshot-{snapshot_name}. This file contains information about which of the files in the shard directory are used in the snapshot and a mapping from the logical file names in the snapshot to the concrete filenames they should be stored as on-disk when being restored. It also contains the checksum, Lucene versioning and size information for all relevant files that can be used to detect and prevent data corruption.

You might wonder why these files have been renamed instead of just keeping their original file names, which potentially would have been easier to work with directly on disk. The reason is simple: it’s possible to snapshot an index, delete and re-create it before snapshotting it again. In this case, several files would end up having the same names, but different contents.

Summary

In this article we have looked at the files written to the data directory by various levels of Elasticsearch: the node, index and shard level. We’ve seen where the Lucene indexes are stored on disk, and briefly described how to use the Lucene CheckIndex tool to verify and fix problematic shards.

Hopefully, you won’t ever need to perform any operations on the contents of the Elasticsearch data directory, but having some insight into what kind of data is written to your file system by your favorite search based database is always a good idea.