Tech Topics

Quick Tips: Negative Connotation Filter

When you help people build applications with Elasticsearch every day, you run into a lot of unique requirements and scenarios. All of us here at Elasticsearch have an ever-growing “bag of tricks”, and it only seems fair to share those tricks with you. We hope you find them interesting, and perhaps useful to your application.

For a recent project, I was working with short documents that represented movie reviews. The goal was a fairly common machine learning task: sentiment analysis and classification. If you’ve tried to build a sentiment analyzer before, you know that it can be tricky. The sentiment of a document depends on not just the words used, but the context in which they are used.

For example, "He is happy" is clearly a “positive” sentiment. But if you add a single word it becomes entirely different: "He is not happy"

There are many ways to attack this problem, such as shingles and phrase matching. For some classifiers, however, providing additional explicit information about the “context” of a word can be very useful. Ideally, we’d like to tag a word and say “this word was in a negative context”.

With the help of @clintongormley‘s super-human regular expression experience, I built a very simple “negative connotation” analyzer for Elasticsearch. Let’s take a look first, then I’ll explain the key points:

curl -XPUT localhost:9200/my_data/ -d '
{
    "settings" : {
        "analysis" : {
            "char_filter" : {
                "pre_negs" : {
                    "type" : "pattern_replace",
                    "pattern" : "(\w+)\s+((?i:never|no|nothing|nowhere|noone|none|not|havent|hasnt|hadnt|cant|couldnt|shouldnt|wont|wouldnt|dont|doesnt|didnt|isnt|arent|aint))\b",
                    "replacement" : "~$1 $2"
                },
                "post_negs" : {
                    "type" : "pattern_replace",
                    "pattern" : "\b((?i:never|no|nothing|nowhere|noone|none|not|havent|hasnt|hadnt|cant|couldnt|shouldnt|wont|wouldnt|dont|doesnt|didnt|isnt|arent|aint))\s+(\w+)",
                    "replacement" : "$1 ~$2"
                }
            },
            "analyzer" : {
                "negcon_tagger" : {
                    "type" : "custom",
                    "tokenizer" : "whitespace",
                    "filter" : ["lowercase", "kstem"],
                    "char_filter" : ["pre_negs", "post_negs"]
                }
            }
        }
    }
}'

The heart of this trick is the pair of char_filter. Character filters are under-appreciated and extremely useful in certain circumstances. They provide a way to modify your input before any analysis takes place.

In this example, we are building a custom analyzer which applies two char_filters. The character filters are of the pattern_replace variety, which performs a standard regex-based replacement.

The regex patterns in this example finds a “negative” word (like “not”, “haven’t”, etc) and tags the words immediately preceeding and following it with a tilde. An important note is that since char_filters are pre-analysis, you should make your patterns case-insensitive using (?i:)

Essentially, this converts "The boy was not happy" into "The boy ~was not ~happy". By explicitly tagging “happy” with a tilde, it can only be matched with a search for “~happy”, which distinguishes it from the regular “happy” token. This was important to the classifier I was using (a naive bayes) and helped improve classification accuracy.

Just to verify, you can check the tokenization output using the Analyze API:

$curl -XGET 'localhost:9200/my_data/_analyze?analyzer=negcon_tagger&pretty' -d "The boy was not happy"

{
  "tokens" : [ {
    "token" : "the",
    "start_offset" : 0,
    "end_offset" : 3,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "boy",
    "start_offset" : 4,
    "end_offset" : 7,
    "type" : "word",
    "position" : 2
  }, {
    "token" : "~was",
    "start_offset" : 8,
    "end_offset" : 12,
    "type" : "word",
    "position" : 3
  }, {
    "token" : "not",
    "start_offset" : 13,
    "end_offset" : 15,
    "type" : "word",
    "position" : 4
  }, {
    "token" : "~happy",
    "start_offset" : 16,
    "end_offset" : 21,
    "type" : "word",
    "position" : 5
  } ]
}

Of course, this is just a very simple implementation. For example, we may not want to tag stopwords like “was” with negative connotation flags. Further, this method will hinder stopwording since it modifies the token in such a way that stop-lists are ineffective. In a future post, I’ll show a native token-filter which produces the same output but is more robust.

I hope this post was interesting and potentially useful to your application. We’ll endeavor to keep publishing little tricks like this, since Elasticsearch has many such little nooks and crannies which can build very powerful tools if you know where to look.