WARNING: The 2.x versions of Elasticsearch have passed their EOL dates. If you are running a 2.x version, we strongly advise you to upgrade.
This documentation is no longer maintained and may be removed. For the latest information, see the current Elasticsearch documentation.
Denormalization and Concurrency
editDenormalization and Concurrency
editOf course, data denormalization has downsides too. The first disadvantage is
that the index will be bigger because the _source
document for every
blog post is bigger, and there are more indexed fields. This usually isn’t a
huge problem. The data written to disk is highly compressed, and disk space
is cheap. Elasticsearch can happily cope with the extra data.
The more important issue is that, if the user were to change his name, all
of his blog posts would need to be updated too. Fortunately, users don’t
often change names. Even if they did, it is unlikely that a user would have
written more than a few thousand blog posts, so updating blog posts with
the scroll
and bulk
APIs would take less than a
second.
However, let’s consider a more complex scenario in which changes are common, far reaching, and, most important, concurrent.
In this example, we are going to emulate a filesystem with directory trees in
Elasticsearch, much like a filesystem on Linux: the root of the directory is
/
, and each directory can contain files and subdirectories.
We want to be able to search for files that live in a particular directory, the equivalent of this:
grep "some text" /clinton/projects/elasticsearch/*
This requires us to index the path of the directory where the file lives:
PUT /fs/file/1 { "name": "README.txt", "path": "/clinton/projects/elasticsearch", "contents": "Starting a new Elasticsearch project is easy..." }
Really, we should also index directory
documents so we can list all
files and subdirectories within a directory, but for brevity’s sake, we will
ignore that requirement.
We also want to be able to search for files that live anywhere in the directory tree below a particular directory, the equivalent of this:
grep -r "some text" /clinton
To support this, we need to index the path hierarchy:
-
/clinton
-
/clinton/projects
-
/clinton/projects/elasticsearch
This hierarchy can be generated automatically from the path
field using the
path_hierarchy
tokenizer:
PUT /fs { "settings": { "analysis": { "analyzer": { "paths": { "tokenizer": "path_hierarchy" } } } } }
The custom |
The mapping for the file
type would look like this:
PUT /fs/_mapping/file { "properties": { "name": { "type": "string", "index": "not_analyzed" }, "path": { "type": "string", "index": "not_analyzed", "fields": { "tree": { "type": "string", "analyzer": "paths" } } } } }
The |
|
The |
Once the index is set up and the files have been indexed, we can perform a
search for files containing elasticsearch
in just the
/clinton/projects/elasticsearch
directory like this:
GET /fs/file/_search { "query": { "filtered": { "query": { "match": { "contents": "elasticsearch" } }, "filter": { "term": { "path": "/clinton/projects/elasticsearch" } } } } }
Every file that lives in any subdirectory under /clinton
will include the
term /clinton
in the path.tree
field. So we can search for all files in
any subdirectory of /clinton
as follows:
GET /fs/file/_search { "query": { "filtered": { "query": { "match": { "contents": "elasticsearch" } }, "filter": { "term": { "path.tree": "/clinton" } } } } }
Renaming Files and Directories
editSo far, so good. Renaming a file is easy—a simple update
or index
request is all that is required. You can even use
optimistic concurrency control to
ensure that your change doesn’t conflict with the changes from another user:
PUT /fs/file/1?version=2 { "name": "README.asciidoc", "path": "/clinton/projects/elasticsearch", "contents": "Starting a new Elasticsearch project is easy..." }
The |
We can even rename a directory, but this means updating all of the files that
exist anywhere in the path hierarchy beneath that directory. This may be
quick or slow, depending on how many files need to be updated. All we would
need to do is to use scroll
to retrieve all the
files, and the bulk
API to update them. The process isn’t
atomic, but all files will quickly move to their new home.