Elasticsearch as a NoSQL Database

UPDATE: This article refers to our hosted Elasticsearch offering by an older name, Found. Please note that Found is now known as Elastic Cloud. 14-day hosted Elasticsearch no-cost trials are available on Elastic Cloud.

Can Elasticsearch be used as a "NoSQL"-database? NoSQL means different things in different contexts, and interestingly it's not really about SQL. We will start out with a "Maybe!", and look into the various properties of Elasticsearch as well as those it has sacrificed, in order to become one of the most flexible, scalable and performant search and analytics engines yet.

What is a NoSQL Database Anyway?

NoSQL-database defines NoSQL as "Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open-source and horizontally scalable.". In other words, it's not a very precise definition.

It's not about SQL in particular. For example, Hive's query language is clearly inspired by SQL. The same is true for Esper's query language, which operates on streams instead of relations. Also, did you know PostgreSQL was named "Postgres" and had "Quel" as its query language back in the days? While first and foremost an ORDBMS, it now also has many features to make it viable as a schemaless document-store.

It's not about ACID-ity either. Hyperdex is one example of a NoSQL-database that aims to provide ACID-transactions. MySQL, certainly an SQL-database, has a history of dubious interpretations of what ACID really means.

Relations? While most of the NoSQL-databases do not support joining in the same sense as traditional relational databases and leave that as an exercise for the user, there are those that do. RethinkDB, Hive and Pig, to name a few. Neo4j, a graph-oriented database, certainly deals with relations - it's excellent at traversing relations (i.e. edges) in graphs. Elasticsearch has a concept of "query time" joining with parent/child-relations and "index time" joining with nested types.

Distributed? While there are some distributed SQL-databases around, and some projects aiming to be something like a NoSQLite, newer generation databases tend to be distributed in some way or another.

To summarize the summary, it neither makes sense to precisely define NoSQL, nor to simply say that Elasticsearch is a "document store"-type NoSQL-database. At the time of writing, nosql-database.org lists >20 of those.

In the next sections, we'll have a look at some important properties and see how Elasticsearch does or does not implement them.

No Transactions

Lucene, which Elasticsearch is built on, has a notion of transactions. Elasticsearch on the other hand, does not have transactions in the typical sense. There is no way to rollback a submitted document, and you cannot submit a group of documents and have either all or none of them indexed. What it does have, however, is a write-ahead-log to ensure the durability of operations without having to do an expensive Lucene-commit. You can also specify the consistency level of index-operations, in terms of how many replicas must acknowledge the operation before returning. This defaults to a quorum, i.e. \(\lfloor\frac{n}{2}\rfloor + 1\).

Visibility of changes is controlled when an index is refreshed, which by default is once per second, and happens on a shard-by-shard-basis.

Optimistic concurrency control is done by specifying the version of the submitted documents.

Elasticsearch is built for speed. Doing distributed transactions is a lot of work. Not providing them makes a lot of things easier. By accepting that what we read can be somewhat stale, and that everyone sees the same timeline, Elasticsearch can serve a lot of things from caches - which is paramount for the mind-boggling performance we love it for.

Schema Flexible

Elasticsearch does not require you to specify a schema upfront. Throw a JSON-document at it, and it will do some educated guessing to infer its type. It does a good job at things like numerics, booleans and timestamps. For strings, it will use the "standard"-analyzer, which is usually good to get started.

While it's arguably "schema free", in the sense that you don't have to specify a schema, we like to think of it as "schema flexible" instead. To develop great search and/or analytics, you really need to tweak your schemas. Elasticsearch has an extensive set of powerful tools to help you, like dynamic templates, multi-field objects, etc. This is covered in more detail in our article on mapping.

Relations and Constraints

Elasticsearch is a document oriented database. The entire object graph you want to search needs to be indexed, so before indexing your documents, they must be denormalized. Denormalization increases retrieval performance (since no query joining is necessary), uses more space (because things must be stored several times), but makes keeping things consistent and up-to-date more difficult (as any change must be applied to all instances). They're excellent for write-once-read-many-workloads, however.

For example, say you have set up database containing customers, orders and products, and you want to search for orders given the name of a product and user. This could be solved by indexing orders with all the necessary information about the user and the products. Searching is then easy, but what happens when you want to change the name of the product? In a relational design with proper normalization, you would simply update the product and be done. That's what they are really good at. With a denormalized document database, every order with the product would have to be updated.

In other words, with document oriented databases like Elasticsearch, we design our mappings and store our documents such that it's optimized for search and retrieval.

As mentioned in the introduction, Elasticsearch has a concept of "query time" joining with parent/child-relations, and "index time" joining with nested types. We'll probably cover this in more depth in a future article. In the meantime, we can recommend Martijn van Groningen's presentation "Document relations with Elasticsearch".

Most relational databases also let you specify constraints to define what is and isn't consistent. For example, referential integrity and uniqueness can be enforced. You can require that the sum of account movements must be positive and so on. Document oriented databases tend not to do this, and Elasticsearch is no different.

Robustness

A database should be robust, especially if it is your authoritative system of record. Ideally, a costly query should be possible to cancel, and you certainly don't want the database to stop working unless you tell it to.

Unfortunately, Elasticsearch (and the components it's made of) does not currently handle OutOfMemory-errors very well. We cover this in more depth in Elasticsearch in Production, OutOfMemory-Caused Crashes. It is very important to provide Elasticsearch with enough memory and be careful before running searches with unknown memory requirements on a production cluster.

While this is likely to improve as Elasticsearch matures, it's important to remember that Elasticsearch is built for speed, with the assumption that memory is abundant.

Distributed

See also: Elasticsearch in Production, Networking.

Before Shay Banon created Elasticsearch, he had been working on Compass. Realizing it would be hard to turn it into a distributed search engine, he started from scratch and created Elasticsearch1. Elasticsearch is designed to be distributed and easy to scale out to handle massive amounts of data on commodity hardware.

Elasticsearch is incredibly easy to use and get started with for a distributed system, but distributed systems are complicated. We cover this a bit more in Elasticsearch in Production, Networking, so what follows is a short summary.

The very nature of a distributed system implies a myriad of things that can go wrong. As such, different database systems focus on different strengths: some strive for strong guarantees, others on always being available, depite of being erroneous even some (or even most) of the time. Furthermore, what a database system claims to achieve when problems occur is rarely what it actually copes with, as Kyle Kingsbury explores in his excellent series on the perils of network partitions. In short, he finds that while the distributed database works fine on a sunny day, most struggle when subjected to the vast amount of possible ways to fail.

In terms of consistency, availability and partition tolerance, Elasticsearch is a CP-system, for a fairly weak definition of "consistent". If you have a read-only workload, Elasticsearch lets you achieve AP-behaviour by having a relaxed "minimum master nodes"-requirement, i.e. not requiring a quorum. Generally, however, you will need the majority of nodes in the cluster to be available. Writing to a misconfigured cluster without this majority, i.e. cluster with a "split brain", can result in irrecoverable dataloss. This is by no means specific to Elasticsearch.

In terms of scaling, an index is divided into one or more shards. This is specified when the index is created and cannot be changed. Thus, an index should be sharded proportionally with the anticipated growth. As more nodes are added to an Elasticsearch cluster, it does a good job at reallocating and moving shards around. As such, Elasticsearch is very easy to scale out.

Security

See also: Elasticsearch in Production, Security.

Elasticsearch does not have any features for authentication or authorization. You should consider anyone who can connect to your Elasticsearch cluster to have "super user" rights, especially if Elasticsearch's powerful scripting capabilities are enabled.

Summary

It is certainly possible to use Elasticsearch as a primary store, when the limitations described are not showstoppers. One good example is when using Logstash. Logstash is a fantastic tool for managing logs and shoving them into Elasticsearch, perhaps also archiving them somewhere else just in case. Logs are write once, read many. No updating, no need for transactions, integrity constraints, etc.

What about systems like Postgres, that come with full-text search and ACID-transactions? (Other examples are the full-text capabilities of MySQL, MongoDB, Riak, etc.) While you can implement basic search with Postgres, there's a huge gap both in possible performance, and in the features. As mentioned in the section on transactions, Elasticsearch can "cheat" and do a lot of caching, with no concern for multi version concurrency control and other complicating things. Search is also more than finding a keyword in a piece of text: it's about applying domain specific knowledge to implement good relevancy models, giving an overview of the entire result space, and doing things like spell checking and autocompletion. All while being fast.

Elasticsearch is commonly used in addition to another database. A database system with stronger focus on constraints, correctness and robustness, and on being readily and transactionally updatable, has the master record - which is then asynchronously pushed to Elasticsearch. (Or pulled, if you use one of Elasticsearch's "rivers".) Keeping things in sync is something we'll cover in depth in a future article. Here at Found, we typically use PostgreSQL and ZooKeeper as keeper of truths, which we feed into Elasticsearch for awesome searching.

Like with everything else, there's no silver bullet, no one database to rule them all. That's likely to always be the case, so know the strengths and weaknesses of your stores!

References

Banon, Shay: The future of compass & elasticSearchhttps://thedudeabides.com/articles/the_future_of_compass


  1. Shay Banon, The future of compass & elasticSearchhttps://thedudeabides.com/articles/the_future_of_compass.