01 May 2018 User Stories

Nothing but Net with Elastic Cloud

By Jeff Whelpley

Introduction

Last year, I was tasked with implementing Elasticsearch for my company’s spending tracker app, Swish. We were under the gun to get the app ready for an important demo that was one week away. Amazingly, I was able to introduce Elasticsearch into the app and get it working in production with just one day of work. This article will go over exactly how I was able move so quickly and what steps you can take to do the same even if you don’t know much about Elasticsearch.

The Problem

Like most budget apps, Swish connects to your bank accounts and starts tracking your spending. The app tries to give you insight into how you spend your money. Part of that includes categorizing your transactions by company. Here are some examples of what bank transaction descriptors look like:

  • SUISHAYA RESTAURA BOSTON MA ON 03/01
  • MASSACHUSETTS YOUTH SOC MA 2455923423243223
  • MARKET BASKET 02/25 #232323223 PURCHASE MARKET

The problem we quickly ran into was that not all descriptors are created equal. For example, we would receive descriptors from the same company, but with varying symbols such as dates or random identifiers. There is (unfortunately for us) no standard pattern you can use to parse out the company name.

The Solution

From a high level, we knew that:

  • We had transaction descriptors with inconsistent formats.
  • We had a database of company names that we wanted to associate with these descriptors so we could provide better experiences, such as providing company information or logos.
  • We needed to quickly implement a solution that would help us associate descriptors with companies and wouldn’t break when a new descriptor format appeared.

I had a week to get a solution in place, but it actually only took me one day thanks to Elasticsearch.

Step 1 - Create a New Cluster

I first needed to get an Elasticsearch instance up and running. Elastic Cloud to the rescue! Yes, this service costs slightly more than if you ran your own Elasticsearch cluster on EC2, but it is well worth it given that:

  1. You won’t need to worry about managing complicated upgrades.
  2. Their Elastic Cloud admin console makes managing your cluster ridiculously easy.
  3. The  Elastic Support team is top notch. 

I started using this service over three years ago when it was just a small startup called Found.no. In my experience, Elastic Cloud has been great for any situation —from small projects with tight deadlines like this one to scaling a large Elasticsearch cluster.

Step 2 - Create a New Index

Once I setup the Elasticsearch cluster through Elastic Cloud, I had to create an index for our data. If you are not familiar with Elasticsearch, the term “index” may be confusing. Here is a good way to think about this if you are more used to relational databases:

  • MySQL => Databases => Tables => Columns/Rows
  • Elasticsearch => Indices => Types => Documents with Properties

Here is the command I used to create my new index for our company data:

curl --user username:password -XPUT 'https://elasticurlhere.us-east-1.aws.found.io:9243/company?pretty' -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "keyword",
          "filter": [
            "lowercase"
          ]
        }
      }
    }
  },
  "mappings": {
    "companytype": {
      "properties": {
        "name": {
          "analyzer": "my_analyzer",
          "type": "text",
          "boost": 8
        },
        "aliases": {
          "analyzer": "my_analyzer",
          "type": "text",
          "boost": 1
        }
      }
    }
  }
}
'

A few things to keep in mind with the command above:

  • If you don’t know the username and password, you can reset it in Elastic Cloud under the Security section.
  • The URL for your Elasticsearch instance can be found in the Overview section of Elastic Cloud.
  • The keyword tokenizer and lowercase filter are used to try and make it easy to do keyword matching when we start running queries.
  • The boost score is used to specify how important one field is relative to another. In this case, I am saying that the `name` field (with a boost score of 8) is much more important than the `aliases` field (with a boost score of 1).
  • The mappings section defines how data is stored within Elasticsearch. This will affect the results that come back from any searches. Feel free to look at the documentation for mappings, but just be forewarned that this is is a pretty dense topic. My approach with Elasticsearch has always been to just take someone else’s example (like this one), see the results and then start making adjustments to the index definition as needed.

Step 3 - Load Data from MongoDB

We were already using MongoDB for a year when we decided to first implement Elasticsearch, so opted to run them in parallel. In general, there are two approaches for getting data from MongoDB into Elasticsearch:

  1. Dump-and-load - Get a data dump from the system of record and then bulk load it into Elasticsearch
  2. Replication stream - Set up an agent that replicates the system of record’s translation logs to Elasticsearch

The second approach is much more complicated and is really only needed when the data set is very large and/or data must be continuously  synchronized. In my case, however, our data set was relatively small (i.e. only 20,000 documents) and it was only updated a few times a week. Therefore, we could go with the following very simple dump-and-load approach:

  • Use mongoexport to generate a file with all the company records from Mongo
    mongoexport -h localhost -d swishdb -c company -f name,aliases -o companies.txt
  • Modify the export file so that for each company document there is an Elasticsearch index command (i.e. { index: { _index: 'company', _type: ‘companytype’, _id: company._id }} added in file right before each document).
// example node.js code to add Elasticsearch bulk data info to an export file
var fs = require('fs');
var path = require('path');
var inputFilePath = path.join(__dirname, 'companies.txt');
var outputFilePath = path.join(__dirname, 'bulkdata.txt');
var companyInput = fs.readFileSync(inputFilePath, 'utf8').split('\n');
var companyOutput = [];
companyInput.forEach(function (companyStr) {
  companyStr = (companyStr || '').trim();
  if (!companyStr) { return; }
  var company = JSON.parse(companyStr);
  delete company._id;  // the document can’t have the _id field in it
  companyOutput.push(JSON.stringify({ 
    index: { _index: 'company', _type: ‘companytype’, _id: company._id }
  }));
  companyOutput.push(JSON.stringify(company));
});
fs.writeFileSync(outputFilePath, companyOutput.join('\n'), 'utf8');
  • Use the Bulk API to load the company data from the modified file into Elasticsearch

curl --user username:password -XPOST 'https://elasticurlhere.us-east-1.aws.found.io:9243/_bulk?pretty' -s -H 'Content-Type: application/x-ndjson' --data-binary '@bulkdata.txt'

This simple dump-and-load approach was easy to implement and helped us solve the problem quickly. In the future, however, when we do move to more of a real time replication solution, we will likely rely on Elasticsearch, Beats and Logstash.

Step 4 - Run Queries

As mentioned earlier, the ultimate goal is matching a bank transaction descriptor with a company name. To do so,  we used a series of prefix searches where we send different parts of the descriptor to Elasticsearch and try to get the highest score with the most characters. Each search query looks something like this where “search text here” is replaced with different parts of the descriptor:

curl --user username:password -XGET 'https://elasticurlhere.us-east-1.aws.found.io:9243/company/_search?pretty' -H 'Content-Type: application/json' -d'
{
  "min_score": 0.1,
  "from": 0,
  "size": 20,
  "query": {
    "bool": {
      "should": [
        {
          "multi_match": {
            "query": "search text here",
            "fields": [
              "name",
              "aliases"
            ],
            "type": "best_fields"
          }
        },
        {
          "multi_match": {
            "query": "search text here",
            "fields": [
              "name",
              "aliases"
            ],
            "type": "phrase_prefix"
          }
        }
      ]
    }
  }
}
'

Just like with type mappings, there are many different variations and options for Elasticsearch queries. This is yet another very dense topic and again my recommendation is to just start with some example like this that you find on the web and then start tweaking the query until the results match your expectations.

The Results

After spending one day implementing the basic indexing and search functionality followed by another couple days of tweaking the query parameters, we were able to get a 60% hit rate. This order of magnitude improvement was much better than our 5% hit rate and our team was pretty excited. In the months that followed we were able to get it up to a 70% hit rate, which we were satisfied with given the somewhat unpredictable nature of bank transaction formats.

Looking back, Elasticsearch saved the day and allowed us to rock our demo. While this may seem like a simple change, it actually resulted in a much smoother user experience as we were able to hide confusing symbols and characters from descriptors by associating them with recognizable company names.

Final Thoughts

There is so much more to Elasticsearch than what this article covers, but my hope is that this article can help get those of you that have never used it before to finally jump in and try it out. Once you get the basic functionality I have outlined here working, you can start to level up your Elasticsearch skills over time.