Categorizing images with deep learning into Elasticsearch

Deepdetect is a young open source deep-learning server and API designed to help in bridging the gap toward machine learning as a commodity. It originates from a series of applications built for a handful of large corporations and small startups. It has support for Caffe, one of the most appreciated libraries for deep learning, and it easily connects to a range of sources and sinks. This enables deep learning to fit into existing stacks and applications with reduced effort.

Machine learning is the next expected commodity on the developer's stack. Many software packages, most of them open source, are slowly but surely empowering the developers with these new technologies for automation. As the developer of Deepdetect, an open source deep-learning server, I've been building a range of smart applications for a variety of large corporations and small startups.

In most of these production settings, the existing stack relies on a series of data backends, among which Elasticsearch is prominent. For this reason Deepdetect builds a direct connection to all backends through a little trick: the server supports output templates so that data can be molded to fit any sink backend.

We recently applied this trick to build an image search engine with Elasticsearch in just a few steps. The result is that you send images to Deepdetect, images get tagged (a hedgehog, a plane, etc.), then the tags and the image URL get indexed into Elasticsearch directly without any glue code. This is it. You can now search images with text, even when no caption was available. Beyond cool, this is also scalable as prediction works over batches of images, and multiple prediction servers can be set to work in parallel. We tell you below how to reproduce this very simple setting for your own applications.

Machine Learning and Deep Learning

But first, if you are not familiar with the topic, machine learning has become a ubiquitous technology that is powering a growing number of high automation software offerings, sometimes referred to as smart or intelligent applications. These range from spam filters and document sorting to sentiment analysis, search ranking, image, audio and video recognition (and even dreams!).

In a nutshell, machine learning automates classification tasks in two steps: the training step builds a model out of data for a targeted task, i.e. for image classification, while the prediction step leverages that model to predict — for instance, the category of more images. There are some cool demonstrations out there; do you believe we can accurately predict exactly who dies on the Titanic?

Machine learning dates back to the 1960s, though it is the last decade's surge in data, cheap computational resources, and crucial scientific achievements that truly unlocked its potential in production.(1) The most advanced applications these days leverage the field of deep learning, a set of models in the form of complex neural networks with the ability to capture high-level abstractions in data. These models are of course computed by dedicated algorithms that for the most part apply a small set of linear and nonlinear operators repeatedly over slices of the training data. Because these operations need to be repeated millions of times over matrices that can reach up to billions of entries, they are best parallelized on special processors, originally dedicated to graphical computations such as video games, and known as Graphic Processing Units (GPUs).

Deep learning continues to reveal spectacular properties, such as the ability to recognize images or classify text without much engineering. Raw images or text are fed to the algorithm along with the desired output, and the resulting model can be used to predict the output on more data. This prediction has very high accuracy, sometimes above that of humans! This is a big step forward compared to previous machine learning systems in which engineers had to half-blindly help the algorithms with a set of hand-made tricks and transforms to the data.

Machine Learning as a Commodity

For these reasons, it is believed that machine learning stands as the next commodity on the developer's stack. This compares to the way Elasticsearch is a storage and search engine commodity for many of us today. We're not exactly there yet with machine learning, and this crucial step requires a lot of work.

Deepdetect is only one young deep learning commodity. Much more mature  frameworks exist, like Apache Mahout, Apache Spark, scikit-learn,, deeplearning4j to name a few. In all honesty, it becomes difficult to make a choice. Not all of them provide support for state-of-the-art deep learning, and their respective APIs vary from low level algorithm parametrization to higher application level. Check them all out as more than one may fit your needs.

Among the data sinks, Elasticsearch is a primary choice among developers. Moreover, most of the machine learning tasks require a fast and scalable data back-end to feed the algorithms, a trend even aggravated by the greedy deep learning algorithms.

Building a search engine of images

So let us use Deepdetect to set a classification service that distinguishes among 1000 different image categories, from 'ambulance' to 'padlock' to 'hedgehog', and indexes images with their categories into an instance of Elasticsearch. For every image, the Deepdetect server can directly post and index the predicted categories into Elasticsearch. This means there's no need for glue code in between the deep learning server and Elasticsearch. This paper has more background on the true cost of building machine learning applications.

First you need to build & install Deepdetect and set up an image classifier. This should take just a few minutes. Now power up an instance of Elasticsearch. In the following, we assume that it is listening on localhost:9200.

Categorizing images into Elasticsearch

Deepdetect supports output templates. An output template allows transforming the standard output of the server into any custom format. Here we use this capability to directly index the Deepdetect output into Elasticsearch.

Here is our first image:

Scene from Interstellar

Scene from Interstellar. Copyright © 2014 by Paramount Pictures. All Rights Reserved.

Let's predict the categories for it and index them along with the image URL into Elasticsearch:

curl -XPOST "http://localhost:8080/predict" -d
'{"service":"imageserv","parameters":{"mllib":{"gpu":true},"input":{"width":224,"height":224},"output":{"best":3,"template":"{ {{#body}}{{#predictions}} \"uri\":\"{{uri}}\",\"categories\": [ {{#classes}} { \"category\":\"{{cat}}\",\"score\":{{prob}} } {{^last}},{{/last}}{{/classes}} ] {{/predictions}}{{/body}} }","network":{"url":"http://localhost:9200/images/img","http_method":"POST"}}},"data":[""]}'

and equivalently using the Deepdetect Python client:

from dd_client import DD
dd = DD('localhost')
mllib = 'caffe'
data = ['']
parameters_input = {'id':'id','separator':',',scale:True}
parameters_mllib = {'gpu':True}
parameters_output = {"best":3,"template":"{ {{#body}}{{#predictions}} \"uri\":\"{{uri}}\",\"categories\": [ {{#classes}} { \"category\":\"{{cat}}\",\"score\":{{prob}} } {{^last}},{{/last}}{{/classes}} ] {{/predictions}}{{/body}} }","network":{"url":"http://localhost:9200/images/img","http_method":"POST"}}
predict_output = dd.post_predict('imageserv',data,parameters_input,parameters_mllib,parameters_output)

which yields:


which is the output of Elasticsearch, as reported by the Deepdetect server.

Let's check that our image is within the index:

curl -XGET "http://localhost:9200/images/_search?q=helmet"
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
"hits" : {
"total" : 1,
"max_score" : 0.09492774,
"hits" : [ {
"_index" : "images",
"_type" : "img",
"_id" : "AVCvc16VzqwAL3DK-gQ8",
"_score" : 0.09492774,
"_source": {"doc": { "uri":"","categories": [ { "category":"n03868863 oxygen mask","score":0.225514 } , { "category":"n03127747 crash helmet","score":0.209176 } , { "category":"n03379051 football helmet","score":0.0739932 } ] } }
} ]

The best prediction the image tagging model comes up with is 'oxygen mask' and the second one is 'crash helmet'. We could expect 'astronaut' instead as a better answer, and the reason why it is missing here is that 'astronaut' is not among the classes the model has been trained to recognize. The full set of classes for this particular model is available here. This means that for every targeted application, you usually need to build a dedicated model.

Also, note the two main parameters in the prediction and indexing calls to the deep learning server:

  • template within the parameters/output object: this takes a template in Mustache format. The available variables are those from the original Deepdetect output. Note the {{^last}},{{/last}} element passed to the template: it allows it to avoid the comma to appear after the last of the 'classes' elements. It is a common trick with mustache templates;
  • network defines where and how the output should be sent. Here, the url holds the Elasticsearch resource and transport information, and http_method is set to 'POST'. Another parameter is content_type, which default to 'Content-Type: application/json'.

These simple two parameters can accommodate a variety of connections to external software applications, far beyond Elasticsearch.

Bulk categorization

Let's improve on the categorization call above to categorize and index multiple images at once. We add the following image:


For categorizing and indexing the two images at once, this time we use the Elasticsearch Bulk API:

curl -XPOST "http://localhost:8080/predict" -d '{"service":"imageserv","parameters":{"mllib":{"gpu":true},"input":{"width":224,"height":224},"output":{"best":3,"template":"{{#body}} {{#predictions}} { \"index\": {\"_index\": \"images\", \"_type\":\"img\" } }\n {\"doc\": { \"uri\":\"{{uri}}\",\"categories\": [ {{#classes}} { \"category\":\"{{cat}}\",\"score\":{{prob}} } {{^last}},{{/last}}{{/classes}} ] } }\n {{/predictions}} {{/body}} }","network":{"url":"http://localhost:9200/images/_bulk","http_method":"POST"}}},"data":["",""]}'

Simple and fast, the two images are now indexed! Let's check on the second one by looking a hedgehog up in our index:

curl -XGET "http://localhost:9200/images/_search?q=hedgehog"
"took" : 6,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
"hits" : {
"total" : 1,
"max_score" : 0.057534903,
"hits" : [ {
"_index" : "images",
"_type" : "img",
"_id" : "AVCvc16VzqwAL3DK-gQ9",
"_score" : 0.057534903,
"_source": {"doc": { "uri":"","categories": [ { "category":"n02346627 porcupine, hedgehog","score":0.783433 } , { "category":"n02138441 meerkat, mierkat","score":0.0204417 } , { "category":"n02442845 mink","score":0.0182722 } ] } }
} ]

We have a hedgehog indexed, which means our prediction appears to have done a good job.

So images can now easily be retrieved by keywords from Elasticsearch. Try it out on your own collections and applications.


  • The deep neural net model used above has been trained on 1000 generic categories. In practice it is not accurate enough for specialized tasks. Some other models are available, from clothing to age prediction. Or train your own models relative to your own applications. In practice, the training of deep models is very computational: the model above requires between two weeks and a full month computation on a professional NVIDIA GPU such as K40. Smaller models, from binary classes (e.g. detection, is there / is not there) to a hundred can be trained in a few hours on a mid-range gamer machine (i.e. including a consumer-grade GPU).;
  • The prediction step, however, is less heavy. You can fill up your RAM or GPU memory with as many images as possible to categorize and index larger batches at once. This should allow it to deal with collections up to several millions of images in just a few hours. Single image prediction runs in about 0.007 second on a GPU (i.e. about 2 hours for a million images), and 1.5 seconds on a Macbook Air's Intel Core I7 3.20GHz CPU.
  • While this short tutorial focuses on images, you can in practice rely on deep neural nets for other tasks without many changes in the above setup and calls. Typical tasks may include text classification, sentiment analysis, data classification, prediction & categorization, image segmentation, etc.
  • Ways exist to build super-fast and very accurate image to image similarity search based on similar deep neural nets.

So we've built a short pipeline for using deep learning over images (e.g. for image categorization) and indexing them into Elasticsearch. This can easily be scaled to millions of images as needed.

With Deepdetect we are touching upon a variety of application fields, from cybersecurity to image and text classification. In cybersecurity especially, the Elastic Stack is among the common tools, and slight modifications to the settings above ensure we can use deep learning as a commodity to better qualify the data points before pushing them into Elasticsearch and Kibana.

(1) A bank check recognition system was already reading 10% of all US checks in late 1990s.

About Emmanuel Benazera

Emmanuel holds a PhD in Computer Science & AI from University Toulouse-III in France. He worked as a researcher in CS, AI and robotics at a variety of labs, from NASA Ames Research Center to DFKI, CNRS and Inria, and founded or co-founded a few startups around intelligent software solutions. His main applied research interests lie in automated decision systems, machine learning, information retrieval, p2p networks and open source software.