02 May 2017 User Stories

Text Classification made easy with Elasticsearch

By Saskia Vola

Elasticsearch is widely used as a search and analytics engine. Its capabilities as a text mining API are not as well known.

In the following article I'd like to show how text classification can be done with Elasticsearch. With a background in computational linguistics and several years of business experience as a freelancer in the field of text mining, I got the chance to implement and test the following techniques in different scenarios.

When I stumbled across Elasticsearch for the first time, I was fascinated by its ease of use, speed and configuration options. Every time I worked with it, I found an even simpler way to achieve what I was used to solve with traditional Natural Language Processing (NLP) tools and techniques.

At some point I realized that it can solve a lot of things out of the box that I was trained to implement from scratch.

Most NLP tasks start with a standard preprocessing pipeline:

  1. Gathering the data
  2. Extracting raw text
  3. Sentence splitting
  4. Tokenization
  5. Normalizing (stemming, lemmatization)
  6. Stopword removal
  7. Part of Speech tagging

Some NLP tasks such as syntactic parsing require deep linguistic analysis.

For this kind of tasks Elasticsearch doesn't provide the ideal architecture and data format out of the box. That is, for tasks that go beyond token-level, custom plugins accessing the full text need to be written or used.

But tasks such as classification, clustering, keyword extraction, measuring similarity etc. only require a normalized and possibly weighted Bag of Words representation of a given document.

Steps 1 and 2 can be solved with the Ingest Attachment Processor Plugin (before 5.0 Mapper Attachments Plugin) in Elasticsearch.

Raw text extraction for these plugins is based on Apache Tika, which works on the most common data formats (HTML/PDF/Word etc.).

Steps 4 to 6 are solved with the language analyzers out of the box.



If the mapping type for a given field is "text" (before 5.0: "analyzed string") and the analyzer is set to one of the languages natively supported by Elasticsearch, tokenization, stemming and stopword removal will be performed automatically at index time.

So no custom code and no other tool is required to get from any kind of document supported by Apache Tika to a Bag of Words representation.

The language analyzers can also be called via REST API, when Elasticsearch is running.

curl -XGET "http://localhost:9200/_analyze?analyzer=english" -d'
   "text" : "This is a test."

The non-Elasticsearch approach looks like this:

Gathering the text with custom code, document parsing by hand or with the Tika library, using a traditional NLP library or API like NLTK, OpenNLP, Stanford NLP, Spacy or anything else which has been developed in some research department. However, tools developed at research departments are usually not very useful for an enterprise context. Very often the data formats are proprietary, the tools need to be compiled and executed on the command line, and the results are very often simply piped to standard out. REST APIs are an exception.

With the Elasticsearch language analyzers, on the other hand, you only need to configure your mapping and index the data. The pre-processing happens automatically at index time.

Traditional approach to text classification

Text classification is a task traditionally solved with supervised machine learning. The input to train a model is a set of labelled documents. The minimal representation of this would be a JSON document with 2 fields:

"content" and "category"

Traditionally, text classification can be solved with a tool like SciKit Learn, Weka, NLTK, Apache Mahout etc.

Creating the models

Most machine learning algorithms require a vector space model representation of the data. The feature space is usually something like the 10,000 most important words of a given dataset. How can the importance of a word be measured?

Usually with TF-IDF. This is a formula that has been invented in the 70ies of the last century. TF-IDF is a weight that scores a term within a given document relative to the rest of the dataset. If a term in a document has a high TF-IDF score it means that it is a very characteristic keyword and distinguishes a document from all other documents by means of that word.

The keywords with the highest TF-IDF scores in a subset of documents can represent a topic. For text classification a feature space with the n words with the highest overall TF-IDF scores is quite common.

Each document is converted to a feature vector, then with all training instances for each class/category a model is created. After that new documents can be classified according to this model. Therefore the document needs to be converted to a feature vector and from there all similarities are computed. The document will be labelled with the category with the highest score.

Text Classification with Elasticsearch

All the above can be solved in a much simpler way with Elasticsearch (or Lucene).

You just need to execute 4 steps:

  1. Configure your mapping ("content" : "text", "category" : "keyword")
  2. Index your documents
  3. Run a More Like This Query (MLT Query)
  4. Write a small script that aggregates the hits of that query by score
PUT sample
  POST sample/document/_mapping
  POST sample/document/1
    "category":"Apple (Fruit)",
    "content":"Granny Smith, Royal Gala, Golden Delicious and Pink Lady are just a few of the thousands of different kinds of apple that are grown around the world! You can make dried apple rings at home - ask an adult to help you take out the core, thinly slice the apple and bake the rings in the oven at a low heat."
  POST sample/document/2
    "category":"Apple (Company)",
    "content":"Apple is an American multinational technology company headquartered in Cupertino, California, that designs, develops, and sells consumer electronics, computer software, and online services. Its hardware products include the iPhone smartphone, the iPad tablet computer, the Mac personal computer, the iPod portable media player, the Apple Watch smartwatch, and the Apple TV digital media player. Apple's consumer software includes the macOS and iOS operating systems, the iTunes media player, the Safari web browser, and the iLife and iWork creativity and productivity suites. Its online services include the iTunes Store, the iOS App Store and Mac App Store, Apple Music, and iCloud."

The MLT query is a very important query for text mining.

How does it work? It can process arbitrary text, extract the top n keywords relative to the actual "model" and run a boolean match query with those keywords. This query is often used to gather similar documents.

If all documents have a class/category label and a similar number of training instances per class this is equivalent to classification. Just run a MLT query with the input document as the like-field and write a small script that aggregates score and category of the top n hits.

GET sample/document/_search
          "like":"The apple tree (Malus pumila, commonly and erroneously called Malus domestica) is a deciduous tree in the rose family best known for its sweet, pomaceous fruit, the apple. It is cultivated worldwide as a fruit tree, and is the most widely grown species in the genus Malus. The tree originated in Central Asia, where its wild ancestor, Malus sieversii, is still found today. Apples have been grown for thousands of years in Asia and Europe, and were brought to North America by European colonists. Apples have religious and mythological significance in many cultures, including Norse, Greek and European Christian traditions.",

This sample is only intended to illustrate the workflow. For real classification you will need a bit more data. So don't worry if with that example you won't get any actual result. Just add more data and it works.

And here's a little Python script that processes the response and returns the most likely category for the input document.

from operator import itemgetter
  def get_best_category(response):
     categories = {}
     for hit in response['hits']['hits']:
         score = hit['_score']
         for category in hit['_source']['category']: 
             if category not in categories:
                 categories[category] = score
                 categories[category] += score
     if len(categories) > 0:
         sortedCategories = sorted(categories.items(), key=itemgetter(1), reverse=True)
         category = sortedCategories[0][0]
     return category

And there is your Elasticsearch text classifier!

Use cases

Classification of text is a very common real world use case for NLP. Think of e-commerce data (products). Lots of people run e-commerce shops with affiliate links. The data is provided by several shops and often comes with a category tag. But each shop has another category tag. So the category systems need to be unified and hence all the data needs to be re-classified according to the new category tree. Or think of a Business Intelligence application where company websites need to be classified according to their sector (hairdresser vs. bakery etc).


I evaluated this approach with a standard text classification dataset: The 20 Newsgroups dataset. The highest precision (92% correct labels) was achieved with a high quality score threshold that included only 12% of the documents. When labelling all documents (100% Recall) 72% of the predictions were correct.

The best algorithms for text classification on the 20 Newsgroups dataset are usually SVM and Naive Bayes. They have a higher average accuracy on the entire dataset.

So why should you consider using Elasticsearch for classification if there are better algorithms?

There are a few practical reasons: training an SVM model takes a lot of time. Especially when you work in a startup or you need to adapt quickly for different customers or use cases that might become a real problem. So you may not be able to retrain your model every time your data changes. I experienced it myself working on a project for a big German bank. Hence you will work with outdated models and those will for sure not score that good anymore.

With the Elasticsearch approach training happens at index time and your model can be updated dynamically at any point in time with zero downtime of your application. If your data is stored in Elasticsearch anyway, you don't need any additional infrastructure. With over 10% highly accurate results you can usually fill the first page. In many applications that's enough for a first good impression.

Why then use Elasticsearch when there are other tools?

Because your data is already there and it's going to pre-compute the underlying statistics anyway. It's almost like you get some NLP for free!

Saskia Vola

Saskia Vola studied Computational Linguistics at the University of Heidelberg and started working in the field of text mining in 2009. After a few years in the Berlin startup scene she decided to become a fulltime freelancer and enjoy the life as a Digital Nomad. As she received more and more relevant project offers, she decided to start a platform for NLP/AI freelancers called textminers.io