Tech Topics

NoSQL, Yes Search

NoSQL seems to be all the hype, helping people get away from databases into more (horizontally) scalable solutions while sacrificing certain database characteristics (and gaining others, such as being schema free). And NoSQL products seems to start showing up everywhere such as Cassandra, HBase, riak, mongodb, Project Voldemort, and Terrastore to name a few.

To this list, I would also add long time (data) grid providers such as GigaSpaces, Oracle Coherence, and IBM ObjectGrid. Products that are playing at a much higher playground than the previous ones (think NoSQL++).

As a side note, its great that the NoSQL solution is gaining steam. If in the past (3-4 years ago) I had to explain and educate why such concepts as collocation, master worker, map reduce, and in memory storage are very important when designing scalable systems, now they have become almost common knowledge.

All products mentioned provide, in one form or another, the ability to query data. Some use SQL like queries, others use map reduce. At one point though, regardless of where or how you store your data, you would want to add search to it. What do I mean when I say search? I mean search in what you expect search engines to provide you with. Think of twitter solving their data storage problems, but not being able to search on all the tweets (preferably in real time).

But now we find ourself in a problematic situation. We finally managed to solve the scaling requirements of our data storage, only to find that we need to add search to our system (hopefully, you are adding search from the get go, as its a must have feature in any application). And the search solution needs to scale. More over, since the dream scenario is that our search solution would allow us to search on all our data, it needs to scale as much as our NoSQL solution does.

For this reason, a true, distributed, scalable search solution is required. And guess what, I think I know of one :). So how to we pull this off, the integration between elasticsearch and our chosen NoSQL solution?

Well, we can start with the simplest solution. Anytime we update the NoSQL solution, we can go ahead and execute the same or a similar update to elasticsearch. Since elasticsearch has a REST API, it can basically integrate with any language that interacts with our chosen NoSQL. This is the simplest way to solve our problem.

A more interesting solution would be to integrate search into NoSQL. Think of an indexing stored procedure running within our NoSQL cluster and every time something changes, the same change is applied to elasticsearch as well. Moreover, converting from either pure JSON, column based storage, or even Object based is just a matter of converting them into an indexable JSON that elasticsearch provides. And, thanks to the fact that elasticsearch data model is flexible (as explained in the Your Data, Your Search blog post), the conversion is extremely simple.

As an example, with GigaSpaces, we can register an event container (against a customizable matching query/template) that will automatically apply any changes done in the data grid into elasticsearch, the code would generally look something like this (including a sneak peak at the upcoming elasticsearch Java API):

void onEvent(Object event, EntryArrivedRemoteEvent event) {
String index = "myIndex";
String type = event.getClass().getSimpleName();
String id = extractId(event);
switch (remoteEvent.getNotifyActionType()) {
case NOTIFY_LEASE_EXPIRATION:
case NOTIFY_TAKE:
esClient.delete(deleteRequest("myIndex")
.type(type)
.id(id));
break;
case NOTIFY_UPDATE:
case NOTIFY_WRITE:
esClient.index(indexRequest("myIndex")
.type(type)
.id(id)
.source(toJson(event)));
break;
}
}

When doing so, we can actually get all the features GigaSpaces provides as a high throughput, low latency, scalable data grid (among many other features), with the option to perform full text search on all/part of the data stored.

Another example, which is already there to play with (proper documentation coming soon) is the integration done between Terrastore and elasticsearch. Since terrastore handles data in JSON format, there is no conversion needed and applying changes in terrastore is just a matter of registering an EventListener (simplified version of this code):

void onValueChanged(final String bucket, final String key, final byte[] value) {
esClient.index(indexRequest("myIndex")
.type(bucket)
.id(key)
.source(value));
}

void onValueRemoved(final String bucket, final String key) {
esClient.delete(deleteRequest("myIndex")
.type(bucket)
.id(key));
}

At the end, your system of record can be a database, a NoSQL product, or something else entirely. With search becoming a requirement in any application built this days (real time search is all the buzz ;) ), a proper thought into how to solve this problem will help you in the long run. And remember, if your data needs scaling, you search solution has to be able to scale as well.

-shay.banon

If someone is interested (I sure as hell am) at getting the same integration into other NoSQL solutions, such as Cassandra, HBase, riak, mongodb, and Project Voldemort, drop a line at elasticsearch and lets go for it.