Getting Started with LIRE and Elasticsearch
UPDATE: This article refers to our hosted Elasticsearch offering by an older name, Found. Please note that Found is now known as Elastic Cloud.
Search for Images Using Images
LIRE (Lucene Image REtrieval) is a plugin for Lucene to index and search images. A cool and quirky feature that sets it apart is that it does content based retrieval, a fancy word for saying that you use images in your search query and it retrieves similar images from the index. In order to use LIRE with Elasticsearch, we need to make Elasticsearch aware of the new data type and the query that is provided by LIRE. Luckily there is a plugin for Elasticsearch that does just that.
Installing
The LIRE plugin for Elasticsearch is called elasticsearch-image and is available on Github. It bundles LIRE and all we have to do is to install a version that matches our Elasticsearch version. Unfortunately, it currently lags a little behind the Elasticsearch releases, but as Elasticsearch is quick to adopt new Lucene versions, this is probably to be expected for plugins that tie in all the way down to the Lucene layer.
The latest release of the plugin is 1.2.0 and it was built against Elasticsearch 1.0.1. In this article we will be using Elasticsearch 1.0.3, because the only changes between those versions are bug fixes. The master branch is currently targeting Elasticsearch 1.1.0, but it is still a work in progress as not all are passing.
On a Found Cluster
To install the plugin on a Found cluster you need to upload it as a custom plugin. The easiest way to get a hold of the plugin is through Maven central. Just search for it and download the zip file corresponding to the desired version and upload it through our console as described in our documentation.
On a Self-Hosted Cluster
If you host the cluster yourself you can simply use the plugin tool like this:
bin/plugin -install com.github.kzwang/elasticsearch-image/1.2.0
Mappings
In order to get started, we need to create an index with appropriate mappings. For this test, the cluster only has one node, and we don’t intend to grow it any further so we only use one shard in the index. What is more interesting is the mappings:
"image": { "properties": { "name": { "type": "string" }, "image": { "type": "image", "feature": { "CEDD": { "hash": "BIT_SAMPLING" }, "JCD": { "hash": ["BIT_SAMPLING", "LSH"] }, "FCTH": {} } } } }
We define a type, called image, and it has two properties: name, the file name of the image and image which will contain the actual image data. The first is a simple string type, like normal, but the latter has type image. The image datatype is defined by the plugin and is how Elasticsearch knows to utilize the LIRE plugin. The data type supports different types of fetures and these in turn provide different ways of matching the query image with the indexed images when searching.
Features Supported
Currently the Elasticsearch plugin supports the following features:
- AUTO_COLOR_CORRELOGRAM
- BINARY_PATTERNS_PYRAMID
- CEDD
- SIMPLE_COLOR_HISTOGRAM
- COLOR_LAYOUT
- EDGE_HISTOGRAM
- FCTH
- GABOR
- JCD
- JOINT_HISTOGRAM
- JPEG_COEFFICIENT_HISTOGRAM
- LOCAL_BINARY_PATTERNS
- LUMINANCE_LAYOUT
- OPPONENT_HISTOGRAM
- PHOG
- ROTATION_INVARIANT_LOCAL_BINARY_PATTERNS
- SCALABLE_COLOR
- TAMURA
Detailing each of these techniques could easily become a comprehensive book on it’s own and I will simply have to refer to the LIRE documentation.
Finally we create the index created like this:
curl -XPOST 'https://cluster_id-eu-west-1.foundcluster.com:9243/images' -d '{ "settings" : { "number_of_shards" : 1 }, "mappings" :{ "image": { "properties": { "name": { "type": "string" }, "image": { "type": "image", "feature": { "CEDD": { "hash": "BIT_SAMPLING" }, "JCD": { "hash": ["BIT_SAMPLING", "LSH"] }, "FCTH": {} } } } } }}'
Indexing Images
Indexing a document containing image data is not much different from indexing any other document. The API is the same, but the exception is that we want to put binary data in the image field. Since JSON does not support binary data as a native value, the plugin expects the data to be base 64 encoded as a JSON string. To distinguish these strings from normal strings, we also have to define the mappings explicitly for the image field.
Base64 encoding is a transformation that encodes 3 bytes of data into four printable charachters. Most programming languages have utilities to do this and if not there is usually not difficult to adapt an example found online.
For this article, I have written a small example in Scala that indexes the 25k images data set from Flickr. Using the sun.misc.BASE64Encoder
it is pretty straight forward to create JSON documents for each image file. The only gotcha is removing the newlines, like this:
import java.nio.file.Files import java.nio.file.Path import sun.misc.BASE64Encoder case class ImageDocument(name: String, data: Array[Byte]) { def this(path: Path) = this(path.toFile().getName(), Files.readAllBytes(path)) def toJson = { s""" { "name" : "$name", "image" :"${new BASE64Encoder().encode(data).replaceAll("[\n\r]", "")}" }""" } }
To retrieve all the images in a certain path
, we define the following function:
def getData(): Stream[ImageDocument] = { for { childPath <- Files.newDirectoryStream(path).asScala.toStream if childPath.toFile().getName().endsWith(".jpg") } yield new ImageDocument(childPath) }
We then push the stream of images in batches to Elasticsearch. First by creating a transport client to my Found cluster and wait for it to connect:
val settings = ImmutableSettings.settingsBuilder() .put("transport.type", "no.found.elasticsearch.transport.netty.FoundNettyTransportModule") .put("transport.found.api-key", "api-key") .put("cluster.name", "cluster_id") .put("client.transport.ignore_cluster_name", false) .build(); val client = { val address = new InetSocketTransportAddress("clusterid-region.foundcluster.com", 9343); ElasticClient.fromClient(new TransportClient(settings).addTransportAddress(address)) } println("Waiting for clusterstatus yellow.") val clusterStatus = client.client.admin().cluster().prepareHealth().setWaitForYellowStatus().execute().actionGet().getStatus().name() println(s"Got clusterstatus $clusterStatus")
With the client ready, the actual indexing is done like this:
import ExecutionContext.Implicits.global val bulkLoads = images.grouped(bulkSize).map { batch => client.bulk( batch.map(image => index into "images/image" source new StringSource(image.toJson)): _*) }.toStream Future.sequence(bulkLoads).onComplete(result => { println("Closing client") client.close() }) for (bulk <- bulkLoads) { bulk.onSuccess { case bulkResult => { println(s"Bulkresult ${bulkResult.getItems()}") for (result <- bulkResult.getItems()) { if (result.isFailed()) { println(s"Failed response: [${result.getFailure()}][${result.getFailureMessage()}]") } else { println("Bulk success") } } } } } val failureCount = (Future sequence bulkLoads.map(_.map(_.asScala.count(_.isFailed())))).map(_.sum) failureCount.onSuccess { case count: Int => { println(s"Found ${images.size} images, $count failed to index") } }
The complete example can be found in the gist.
Things to Consider While Indexing
Chances are, your images will be quite a bit larger than the documents you usually index. This implies that you will probably have to use quite a bit smaller batches while indexing. As a point of reference I managed to index the entire 25k Flickr data set on a single node elasticsearch cluster with 2 GB memory (1 GB heap) and 70 images in each batch. It is also recommended to slow down the refresh interval while doing heavy indexing.
Searching
The Elasticsearch-image plugin also provides a new query type, aptly named image. Using it with the rest API looks like this:
curl -XPOST 'localhost:9200/images/_search' -d '{ "fields": [ "name" ], "query": { "image": { "image": { "feature": "CEDD", "image": "<Base64 encoded image binary>", "hash": "BIT_SAMPLING", "limit": 10 } } } }
Don’t load the images from Elasticsearch unless you have to. In this example query we use the fields parameter to load only the names of the images. This is because the images are typically large, and it really slows down searches if you want Elasticsearch to deliver this data. You can read more about this and similar tuning options in the article Managing Elasticsearch fields when searching.
Strictly speaking, the fields and limit parameters are optional, but they are higly recommended for performance reasons.
Using the Transport Client
With the transport client, you get the issue that the client does not have methods and types to build an image query, but luckily you may specify a query with dsl syntax.
To continue with the same example as when we were indexing we can augment the ImageDocumnt class with the following function
def toQuery: String = { s""" { "image": { "image": { "feature": "CEDD", "image": "${new BASE64Encoder().encode(data).replaceAll("[\n\r]", "")}", "hash": "BIT_SAMPLING", "limit": 10 } } }""" }
And then we can perform the same search, as with the HTTP API, like this:
client.prepareSearch("images").setQuery(image.toQuery).addField("name").get()
Testing the Results
In order to test the search, I made a script that selects a random image from the set and searches for it and lists the top 10 matching images and their rank. You can find it too in the gist.
An example run looks like this:
Searching for images similar to: im10314.jpg Found: [im10314.jpg] with score: [1.0] Found: [im21075.jpg] with score: [0.0362315] Found: [im4494.jpg] with score: [0.03478625] Found: [im23103.jpg] with score: [0.0347418] Found: [im13182.jpg] with score: [0.031561606] Found: [im2384.jpg] with score: [0.030222585] Found: [im18142.jpg] with score: [0.03011181] Found: [im2031.jpg] with score: [0.028815333] Found: [im7330.jpg] with score: [0.024395658] Found: [im22767.jpg] with score: [0.02146201]
What is worth taking a note of is that it is capable of detecting the exact match, which gets the highest score of 1.0. And in ranking order, these are the images found:
Other than having slightly similar colors, the images are not very similar, but the ranking values are not very high either. There are two possible causes for this, either there are no similar images in the data set or the choice of matching algorithm was a poor one for this data set. Delving into the specifics of the different matching algorithms is unfortunately out of scope this time, but we can try to choose an image that has good matches within the dataset by simply repeating the process with different images. After modifying the script to output the average rank of the top three hits whenever it increases, we get the following output:
New leader: im10.jpg, score: 0.5457066 New leader: im10064.jpg, score: 0.55895704 New leader: im10069.jpg, score: 0.61380416 New leader: im10090.jpg, score: 0.7782447 New leader: im10622.jpg, score: 0.9452252 New leader: im10817.jpg, score: 0.9933184
The top three images look like this:
Clearly, it is easier to achieve high rankings with black and white photos for this algorithm. For the sake of completeness the results for the top ranking image was:
Images similar to: im10817.jpg Found: [im10817.jpg] with score: [1.0] Found: [im3443.jpg] with score: [1.0] Found: [im8545.jpg] with score: [0.97995514] Found: [im1360.jpg] with score: [0.5906452] Found: [im19056.jpg] with score: [0.3436884] Found: [im9820.jpg] with score: [0.31441176] Found: [im12341.jpg] with score: [0.30534726] Found: [im6995.jpg] with score: [0.3012863] Found: [im6038.jpg] with score: [0.2511092] Found: [im5160.jpg] with score: [0.2422185]
Plane Images
We have seen some similarity, but honestly I’m not yet impressed. In this final attempt, I handpick five images of the same motif: an airplane against a blue sky. I add them to the same index and execute the same query to see if any of them are retrieved within the top ten.
Image credits:
Images similar to: 8401586753_91a218f95e_o.jpg Found: [8401586753_91a218f95e_o.jpg] with score: [1.0] Found: [im14728.jpg] with score: [0.07146341] Found: [im2633.jpg] with score: [0.059009902] Found: [im13879.jpg] with score: [0.05875] Found: [im2643.jpg] with score: [0.04318436] Found: [im12893.jpg] with score: [0.039066523] Found: [im23867.jpg] with score: [0.037845306] Found: [im14107.jpg] with score: [0.033956632] Found: [im15524.jpg] with score: [0.032459892] Found: [im21666.jpg] with score: [0.03232]
Images similar to: 14668291623_2cbb04aa1c_o.jpg Found: [14668291623_2cbb04aa1c_o.jpg] with score: [1.0] Found: [im12011.jpg] with score: [0.058475036] Found: [im18417.jpg] with score: [0.058243487] Found: [im23867.jpg] with score: [0.049384493] Found: [im12893.jpg] with score: [0.046724197] Found: [im4108.jpg] with score: [0.045620646] Found: [im15983.jpg] with score: [0.04494405] Found: [im14728.jpg] with score: [0.0403279] Found: [im17209.jpg] with score: [0.03635637] Found: [im4504.jpg] with score: [0.03228649]
Images similar to: 14264287401_e68c94ee85_o.jpg Found: [14264287401_e68c94ee85_o.jpg] with score: [1.0] Found: [im3471.jpg] with score: [0.058208954] Found: [im15179.jpg] with score: [0.0427226] Found: [im3716.jpg] with score: [0.04201087] Found: [im22055.jpg] with score: [0.04192308] Found: [im12893.jpg] with score: [0.040504586] Found: [im14161.jpg] with score: [0.038947366] Found: [im12820.jpg] with score: [0.038699917] Found: [im4108.jpg] with score: [0.036421873] Found: [im9682.jpg] with score: [0.031241378]
Images similar to: 11731068223_2bb5258d7a_o.jpg Found: [11731068223_2bb5258d7a_o.jpg] with score: [1.0] Found: [im16938.jpg] with score: [0.07487332] Found: [im13637.jpg] with score: [0.06531328] Found: [im17338.jpg] with score: [0.065182954] Found: [im22588.jpg] with score: [0.057351522] Found: [im11164.jpg] with score: [0.05494677] Found: [im794.jpg] with score: [0.048871726] Found: [im13775.jpg] with score: [0.045117185] Found: [im5373.jpg] with score: [0.043741353] Found: [im24607.jpg] with score: [0.042783484]
Images similar to: 3893372643_a10546bacb_o.jpg Found: [3893372643_a10546bacb_o.jpg] with score: [1.0] Found: [im11164.jpg] with score: [0.058427628] Found: [im14188.jpg] with score: [0.051731724] Found: [im16314.jpg] with score: [0.041284915] Found: [im11603.jpg] with score: [0.03983957] Found: [im22150.jpg] with score: [0.039609164] Found: [im20188.jpg] with score: [0.039445836] Found: [im10609.jpg] with score: [0.038841564] Found: [im22588.jpg] with score: [0.034103587] Found: [im22229.jpg] with score: [0.033843976]
The surprising result in this test is that none of the handpicked images matched one another (among the top ten at least), but there seems to be more similar results along with a few odd ones. Many of the images found contain blue skies and a few of them even show an airplane. One explanation for this improvement could be that the handpicked images are larger, but one cannot really conclude on this as it is not a statistically significant sample.
Conclusion
The elasticsearch-image plugin is the way to go if Elasticsearch is what you got and content based image search is what you need. That said, it is still a little rough around the edges at times, but hopefully future releases will address that as well as targeting newer Elasticsearch versions.
Although we only had time to test one algorithm this time, LIRE has a lot more to offer and just as there are different kinds of images like photos, drawings or t-shirt patterns to name a few, it is not immediately given which one is the best fit for a given use case.