15 August 2014

Getting Started with LIRE and Elasticsearch

Von Konrad Beiske

UPDATE: This article refers to our hosted Elasticsearch offering by an older name, Found. Please note that Found is now known as Elastic Cloud.

Search for Images Using Images

LIRE (Lucene Image REtrieval) is a plugin for Lucene to index and search images. A cool and quirky feature that sets it apart is that it does content based retrieval, a fancy word for saying that you use images in your search query and it retrieves similar images from the index. In order to use LIRE with Elasticsearch, we need to make Elasticsearch aware of the new data type and the query that is provided by LIRE. Luckily there is a plugin for Elasticsearch that does just that.

Installing

The LIRE plugin for Elasticsearch is called elasticsearch-image and is available on Github. It bundles LIRE and all we have to do is to install a version that matches our Elasticsearch version. Unfortunately, it currently lags a little behind the Elasticsearch releases, but as Elasticsearch is quick to adopt new Lucene versions, this is probably to be expected for plugins that tie in all the way down to the Lucene layer.

The latest release of the plugin is 1.2.0 and it was built against Elasticsearch 1.0.1. In this article we will be using Elasticsearch 1.0.3, because the only changes between those versions are bug fixes. The master branch is currently targeting Elasticsearch 1.1.0, but it is still a work in progress as not all are passing.

On a Found Cluster

To install the plugin on a Found cluster you need to upload it as a custom plugin. The easiest way to get a hold of the plugin is through Maven central. Just search for it and download the zip file corresponding to the desired version and upload it through our console as described in our documentation.

On a Self-Hosted Cluster

If you host the cluster yourself you can simply use the plugin tool like this:

bin/plugin -install com.github.kzwang/elasticsearch-image/1.2.0

Mappings

In order to get started, we need to create an index with appropriate mappings. For this test, the cluster only has one node, and we don’t intend to grow it any further so we only use one shard in the index. What is more interesting is the mappings:

    "image": {
        "properties": {
            "name": {
                "type": "string"
            },
            "image": {
                "type": "image",
                "feature": {
                    "CEDD": {
                        "hash": "BIT_SAMPLING"
                    },
                    "JCD": {
                        "hash": ["BIT_SAMPLING", "LSH"]
                    },
                    "FCTH": {}
                }
            }
        }
    }

We define a type, called image, and it has two properties: name, the file name of the image and image which will contain the actual image data. The first is a simple string type, like normal, but the latter has type image. The image datatype is defined by the plugin and is how Elasticsearch knows to utilize the LIRE plugin. The data type supports different types of fetures and these in turn provide different ways of matching the query image with the indexed images when searching.

Features Supported

Currently the Elasticsearch plugin supports the following features:

  • AUTO_COLOR_CORRELOGRAM
  • BINARY_PATTERNS_PYRAMID
  • CEDD
  • SIMPLE_COLOR_HISTOGRAM
  • COLOR_LAYOUT
  • EDGE_HISTOGRAM
  • FCTH
  • GABOR
  • JCD
  • JOINT_HISTOGRAM
  • JPEG_COEFFICIENT_HISTOGRAM
  • LOCAL_BINARY_PATTERNS
  • LUMINANCE_LAYOUT
  • OPPONENT_HISTOGRAM
  • PHOG
  • ROTATION_INVARIANT_LOCAL_BINARY_PATTERNS
  • SCALABLE_COLOR
  • TAMURA

Detailing each of these techniques could easily become a comprehensive book on it’s own and I will simply have to refer to the LIRE documentation.

Finally we create the index created like this:

curl -XPOST 'https://cluster_id-eu-west-1.foundcluster.com:9243/images' -d '{
    "settings" : {
        "number_of_shards" : 1
    },
    "mappings" :{
    "image": {
        "properties": {
            "name": {
                "type": "string"
            },
            "image": {
                "type": "image",
                "feature": {
                    "CEDD": {
                        "hash": "BIT_SAMPLING"
                    },
                    "JCD": {
                        "hash": ["BIT_SAMPLING", "LSH"]
                    },
                    "FCTH": {}
                }
            }
        }
    }
}}'

Indexing Images

Indexing a document containing image data is not much different from indexing any other document. The API is the same, but the exception is that we want to put binary data in the image field. Since JSON does not support binary data as a native value, the plugin expects the data to be base 64 encoded as a JSON string. To distinguish these strings from normal strings, we also have to define the mappings explicitly for the image field.

Base64 encoding is a transformation that encodes 3 bytes of data into four printable charachters. Most programming languages have utilities to do this and if not there is usually not difficult to adapt an example found online.

For this article, I have written a small example in Scala that indexes the 25k images data set from Flickr. Using the sun.misc.BASE64Encoder it is pretty straight forward to create JSON documents for each image file. The only gotcha is removing the newlines, like this:

import java.nio.file.Files
import java.nio.file.Path
import sun.misc.BASE64Encoder
case class ImageDocument(name: String, data: Array[Byte]) {
  def this(path: Path) = this(path.toFile().getName(), Files.readAllBytes(path))
  def toJson = {
    s"""
    {
        "name" : "$name",
        "image" :"${new BASE64Encoder().encode(data).replaceAll("[\n\r]", "")}"
    }"""
  }
}

To retrieve all the images in a certain path, we define the following function:

  def getData(): Stream[ImageDocument] = {
    for {
      childPath <- Files.newDirectoryStream(path).asScala.toStream
      if childPath.toFile().getName().endsWith(".jpg")
    } yield new ImageDocument(childPath)
  }

We then push the stream of images in batches to Elasticsearch. First by creating a transport client to my Found cluster and wait for it to connect:

val settings = ImmutableSettings.settingsBuilder()
  .put("transport.type", "no.found.elasticsearch.transport.netty.FoundNettyTransportModule")
  .put("transport.found.api-key", "api-key")
  .put("cluster.name", "cluster_id")
  .put("client.transport.ignore_cluster_name", false)
  .build();
val client = {
  val address = new InetSocketTransportAddress("clusterid-region.foundcluster.com", 9343);
  ElasticClient.fromClient(new TransportClient(settings).addTransportAddress(address))
}
println("Waiting for clusterstatus yellow.")
val clusterStatus = client.client.admin().cluster().prepareHealth().setWaitForYellowStatus().execute().actionGet().getStatus().name()
println(s"Got clusterstatus $clusterStatus")

With the client ready, the actual indexing is done like this:

import ExecutionContext.Implicits.global
val bulkLoads = images.grouped(bulkSize).map {
  batch =>
    client.bulk(
      batch.map(image => index into "images/image" source new StringSource(image.toJson)): _*)
}.toStream
Future.sequence(bulkLoads).onComplete(result => {
  println("Closing client")
  client.close()
})
for (bulk <- bulkLoads) {
  bulk.onSuccess {
    case bulkResult => {
      println(s"Bulkresult ${bulkResult.getItems()}")
      for (result <- bulkResult.getItems()) {
        if (result.isFailed()) {
          println(s"Failed response: [${result.getFailure()}][${result.getFailureMessage()}]")
        } else {
          println("Bulk success")
        }
      }
    }
  }
}
val failureCount = (Future sequence bulkLoads.map(_.map(_.asScala.count(_.isFailed())))).map(_.sum)
failureCount.onSuccess {
  case count: Int => {
    println(s"Found ${images.size} images, $count failed to index")
  }
}

The complete example can be found in the gist.

Things to Consider While Indexing

Chances are, your images will be quite a bit larger than the documents you usually index. This implies that you will probably have to use quite a bit smaller batches while indexing. As a point of reference I managed to index the entire 25k Flickr data set on a single node elasticsearch cluster with 2 GB memory (1 GB heap) and 70 images in each batch. It is also recommended to slow down the refresh interval while doing heavy indexing.

Searching

The Elasticsearch-image plugin also provides a new query type, aptly named image. Using it with the rest API looks like this:

curl -XPOST 'localhost:9200/images/_search' -d '{
    "fields": [
        "name"
    ],
    "query": {
        "image": {
            "image": {
                "feature": "CEDD",
                "image": "<Base64 encoded image binary>",
                "hash": "BIT_SAMPLING",
                "limit": 10
            }
        }
    }
}

Don’t load the images from Elasticsearch unless you have to. In this example query we use the fields parameter to load only the names of the images. This is because the images are typically large, and it really slows down searches if you want Elasticsearch to deliver this data. You can read more about this and similar tuning options in the article Managing Elasticsearch fields when searching.

Strictly speaking, the fields and limit parameters are optional, but they are higly recommended for performance reasons.

Using the Transport Client

With the transport client, you get the issue that the client does not have methods and types to build an image query, but luckily you may specify a query with dsl syntax.

To continue with the same example as when we were indexing we can augment the ImageDocumnt class with the following function

def toQuery: String = {
  s"""
  {   
      "image": {
          "image": {
              "feature": "CEDD",
              "image": "${new BASE64Encoder().encode(data).replaceAll("[\n\r]", "")}",
              "hash": "BIT_SAMPLING",
              "limit": 10
          }
      }
  }"""
}

And then we can perform the same search, as with the HTTP API, like this:

client.prepareSearch("images").setQuery(image.toQuery).addField("name").get()

Testing the Results

In order to test the search, I made a script that selects a random image from the set and searches for it and lists the top 10 matching images and their rank. You can find it too in the gist.

An example run looks like this:

Searching for images similar to: im10314.jpg
Found: [im10314.jpg] with score: [1.0]
Found: [im21075.jpg] with score: [0.0362315]
Found: [im4494.jpg] with score: [0.03478625]
Found: [im23103.jpg] with score: [0.0347418]
Found: [im13182.jpg] with score: [0.031561606]
Found: [im2384.jpg] with score: [0.030222585]
Found: [im18142.jpg] with score: [0.03011181]
Found: [im2031.jpg] with score: [0.028815333]
Found: [im7330.jpg] with score: [0.024395658]
Found: [im22767.jpg] with score: [0.02146201]

What is worth taking a note of is that it is capable of detecting the exact match, which gets the highest score of 1.0. And in ranking order, these are the images found:

im10314.jpg im21075.jpg im4494.jpg im23103.jpg im13182.jpg im2384.jpg im18142.jpg im7330.jpg im22767.jpg

Other than having slightly similar colors, the images are not very similar, but the ranking values are not very high either. There are two possible causes for this, either there are no similar images in the data set or the choice of matching algorithm was a poor one for this data set. Delving into the specifics of the different matching algorithms is unfortunately out of scope this time, but we can try to choose an image that has good matches within the dataset by simply repeating the process with different images. After modifying the script to output the average rank of the top three hits whenever it increases, we get the following output:

New leader: im10.jpg, score: 0.5457066
New leader: im10064.jpg, score: 0.55895704
New leader: im10069.jpg, score: 0.61380416
New leader: im10090.jpg, score: 0.7782447
New leader: im10622.jpg, score: 0.9452252
New leader: im10817.jpg, score: 0.9933184

The top three images look like this:

im10090.jpg im10622.jpg im10817.jpg

Clearly, it is easier to achieve high rankings with black and white photos for this algorithm. For the sake of completeness the results for the top ranking image was:

Images similar to: im10817.jpg
Found: [im10817.jpg] with score: [1.0]
Found: [im3443.jpg] with score: [1.0]
Found: [im8545.jpg] with score: [0.97995514]
Found: [im1360.jpg] with score: [0.5906452]
Found: [im19056.jpg] with score: [0.3436884]
Found: [im9820.jpg] with score: [0.31441176]
Found: [im12341.jpg] with score: [0.30534726]
Found: [im6995.jpg] with score: [0.3012863]
Found: [im6038.jpg] with score: [0.2511092]
Found: [im5160.jpg] with score: [0.2422185]

im10817.jpg im3443.jpg im8545.jpg im1360.jpg im19056.jpg im9820.jpg im12341.jpg im6995.jpg im6038.jpg im5160.jpg

Plane Images

We have seen some similarity, but honestly I’m not yet impressed. In this final attempt, I handpick five images of the same motif: an airplane against a blue sky. I add them to the same index and execute the same query to see if any of them are retrieved within the top ten.

Image credits:

Images similar to: 8401586753_91a218f95e_o.jpg
Found: [8401586753_91a218f95e_o.jpg] with score: [1.0]
Found: [im14728.jpg] with score: [0.07146341]
Found: [im2633.jpg] with score: [0.059009902]
Found: [im13879.jpg] with score: [0.05875]
Found: [im2643.jpg] with score: [0.04318436]
Found: [im12893.jpg] with score: [0.039066523]
Found: [im23867.jpg] with score: [0.037845306]
Found: [im14107.jpg] with score: [0.033956632]
Found: [im15524.jpg] with score: [0.032459892]
Found: [im21666.jpg] with score: [0.03232]

8401586753_91a218f95e_o.jpg im14728.jpg im2633.jpg im13879.jpg im2643.jpg im12893.jpg im23867.jpg im14107.jpg im15524.jpg im21666.jpg

Images similar to: 14668291623_2cbb04aa1c_o.jpg
Found: [14668291623_2cbb04aa1c_o.jpg] with score: [1.0]
Found: [im12011.jpg] with score: [0.058475036]
Found: [im18417.jpg] with score: [0.058243487]
Found: [im23867.jpg] with score: [0.049384493]
Found: [im12893.jpg] with score: [0.046724197]
Found: [im4108.jpg] with score: [0.045620646]
Found: [im15983.jpg] with score: [0.04494405]
Found: [im14728.jpg] with score: [0.0403279]
Found: [im17209.jpg] with score: [0.03635637]
Found: [im4504.jpg] with score: [0.03228649]

14668291623_2cbb04aa1c_o.jpg im12011.jpg im18417.jpg im23867.jpg im12893.jpg im4108.jpg im15983.jpg im14728.jpg im17209.jpg im4504.jpg

Images similar to: 14264287401_e68c94ee85_o.jpg
Found: [14264287401_e68c94ee85_o.jpg] with score: [1.0]
Found: [im3471.jpg] with score: [0.058208954]
Found: [im15179.jpg] with score: [0.0427226]
Found: [im3716.jpg] with score: [0.04201087]
Found: [im22055.jpg] with score: [0.04192308]
Found: [im12893.jpg] with score: [0.040504586]
Found: [im14161.jpg] with score: [0.038947366]
Found: [im12820.jpg] with score: [0.038699917]
Found: [im4108.jpg] with score: [0.036421873]
Found: [im9682.jpg] with score: [0.031241378]

14264287401_e68c94ee85_o.jpg im3471.jpg im15179.jpg im3716.jpg im22055.jpg im12893.jpg im14161.jpg im12820.jpg im4108.jpg im9682.jpg

Images similar to: 11731068223_2bb5258d7a_o.jpg
Found: [11731068223_2bb5258d7a_o.jpg] with score: [1.0]
Found: [im16938.jpg] with score: [0.07487332]
Found: [im13637.jpg] with score: [0.06531328]
Found: [im17338.jpg] with score: [0.065182954]
Found: [im22588.jpg] with score: [0.057351522]
Found: [im11164.jpg] with score: [0.05494677]
Found: [im794.jpg] with score: [0.048871726]
Found: [im13775.jpg] with score: [0.045117185]
Found: [im5373.jpg] with score: [0.043741353]
Found: [im24607.jpg] with score: [0.042783484]

11731068223_2bb5258d7a_o.jpg im16938.jpg im13637.jpg im17338.jpg im22588.jpg im11164.jpg im794.jpg im13775.jpg im5373.jpg im24607.jpg

Images similar to: 3893372643_a10546bacb_o.jpg
Found: [3893372643_a10546bacb_o.jpg] with score: [1.0]
Found: [im11164.jpg] with score: [0.058427628]
Found: [im14188.jpg] with score: [0.051731724]
Found: [im16314.jpg] with score: [0.041284915]
Found: [im11603.jpg] with score: [0.03983957]
Found: [im22150.jpg] with score: [0.039609164]
Found: [im20188.jpg] with score: [0.039445836]
Found: [im10609.jpg] with score: [0.038841564]
Found: [im22588.jpg] with score: [0.034103587]
Found: [im22229.jpg] with score: [0.033843976]

3893372643_a10546bacb_o.jpg im11164.jpg im14188.jpg im16314.jpg im11603.jpg im22150.jpg im20188.jpg im10609.jpg im22588.jpg im22229.jpg

The surprising result in this test is that none of the handpicked images matched one another (among the top ten at least), but there seems to be more similar results along with a few odd ones. Many of the images found contain blue skies and a few of them even show an airplane. One explanation for this improvement could be that the handpicked images are larger, but one cannot really conclude on this as it is not a statistically significant sample.

Conclusion

The elasticsearch-image plugin is the way to go if Elasticsearch is what you got and content based image search is what you need. That said, it is still a little rough around the edges at times, but hopefully future releases will address that as well as targeting newer Elasticsearch versions.

Although we only had time to test one algorithm this time, LIRE has a lot more to offer and just as there are different kinds of images like photos, drawings or t-shirt patterns to name a few, it is not immediately given which one is the best fit for a given use case.