Nothing but Net with Elastic Cloud
Want to learn more about the differences between the Amazon Elasticsearch Service and our official Elasticsearch Service? Visit our AWS Elasticsearch comparison page to learn more.
Introduction
Last year, I was tasked with implementing Elasticsearch for my companyโs spending tracker app, Swish. We were under the gun to get the app ready for an important demo that was one week away. Amazingly, I was able to introduce Elasticsearch into the app and get it working in production with just one day of work. This article will go over exactly how I was able move so quickly and what steps you can take to do the same even if you donโt know much about Elasticsearch.
The Problem
Like most budget apps, Swish connects to your bank accounts and starts tracking your spending. The app tries to give you insight into how you spend your money. Part of that includes categorizing your transactions by company. Here are some examples of what bank transaction descriptors look like:
- SUISHAYA RESTAURA BOSTON MA ON 03/01
- MASSACHUSETTS YOUTH SOC MA 2455923423243223
- MARKET BASKET 02/25 #232323223 PURCHASE MARKET
The problem we quickly ran into was that not all descriptors are created equal. For example, we would receive descriptors from the same company, but with varying symbols such as dates or random identifiers. There is (unfortunately for us) no standard pattern you can use to parse out the company name.
The Solution
From a high level, we knew that:
- We had transaction descriptors with inconsistent formats.
- We had a database of company names that we wanted to associate with these descriptors so we could provide better experiences, such as providing company information or logos.
- We needed to quickly implement a solution that would help us associate descriptors with companies and wouldnโt break when a new descriptor format appeared.
I had a week to get a solution in place, but it actually only took me one day thanks to Elasticsearch.
Step 1 - Create a New Cluster
I first needed to get an Elasticsearch instance up and running. Elastic's Elasticsearch Service to the rescue (they offer a free trial)! Yes, this service costs slightly more than if you ran your own Elasticsearch cluster on EC2, but it is well worth it given that:
- You wonโt need to worry about managing complicated upgrades.
- Their Elastic Cloud admin console makes managing your cluster ridiculously easy.
- Theย Elastic Support team is top notch.ย
I started using this service over three years ago when it was just a small startup called Found.no. In my experience, Elastic Cloud has been great for any situation โfrom small projects with tight deadlines like this one to scaling a large Elasticsearch cluster.
Step 2 - Create a New Index
Once I setup the Elasticsearch cluster through Elastic Cloud, I had to create an index for our data. If you are not familiar with Elasticsearch, the term โindexโ may be confusing. Here is a good way to think about this if you are more used to relational databases:
- MySQL => Databases => Tables => Columns/Rows
- Elasticsearch => Indices => Types => Documents with Properties
Here is the command I used to create my new index for our company data:
curl --user username:password -XPUT 'https://elasticurlhere.us-east-1.aws.found.io:9243/company?pretty' -H 'Content-Type: application/json' -d' { "settings": { "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "keyword", "filter": [ "lowercase" ] } } } }, "mappings": { "companytype": { "properties": { "name": { "analyzer": "my_analyzer", "type": "text", "boost": 8 }, "aliases": { "analyzer": "my_analyzer", "type": "text", "boost": 1 } } } } } '
A few things to keep in mind with the command above:
- If you donโt know the username and password, you can reset it in Elastic Cloud under the Security section.
- The URL for your Elasticsearch instance can be found in the Overview section of Elastic Cloud.
- The keyword tokenizer and lowercase filter are used to try and make it easy to do keyword matching when we start running queries.
- The boost score is used to specify how important one field is relative to another. In this case, I am saying that the `name` field (with a boost score of 8) is much more important than the `aliases` field (with a boost score of 1).
- The mappings section defines how data is stored within Elasticsearch. This will affect the results that come back from any searches. Feel free to look at the documentation for mappings, but just be forewarned that this is is a pretty dense topic. My approach with Elasticsearch has always been to just take someone elseโs example (like this one), see the results and then start making adjustments to the index definition as needed.
Step 3 - Load Data from MongoDB
We were already using MongoDB for a year when we decided to first implement Elasticsearch, so opted to run them in parallel. In general, there are two approaches for getting data from MongoDB into Elasticsearch:
- Dump-and-load - Get a data dump from the system of record and then bulk load it into Elasticsearch
- Replication stream - Set up an agent that replicates the system of recordโs translation logs to Elasticsearch
The second approach is much more complicated and is really only needed when the data set is very large and/or data must be continuouslyย synchronized. In my case, however, our data set was relatively small (i.e. only 20,000 documents) and it was only updated a few times a week. Therefore, we could go with the following very simple dump-and-load approach:
- Use mongoexport to generate a file with all the company records from Mongo
mongoexport -h localhost -d swishdb -c company -f name,aliases -o companies.txt
- Modify the export file so that for each company document there is an Elasticsearch index command (i.e.
{ index: { _index: 'company', _type: โcompanytypeโ, _id: company._id }}
added in file right before each document).
// example node.js code to add Elasticsearch bulk data info to an export file var fs = require('fs'); var path = require('path'); var inputFilePath = path.join(__dirname, 'companies.txt'); var outputFilePath = path.join(__dirname, 'bulkdata.txt'); var companyInput = fs.readFileSync(inputFilePath, 'utf8').split('\n'); var companyOutput = []; companyInput.forEach(function (companyStr) { companyStr = (companyStr || '').trim(); if (!companyStr) { return; } var company = JSON.parse(companyStr); delete company._id; // the document canโt have the _id field in it companyOutput.push(JSON.stringify({ index: { _index: 'company', _type: โcompanytypeโ, _id: company._id } })); companyOutput.push(JSON.stringify(company)); }); fs.writeFileSync(outputFilePath, companyOutput.join('\n'), 'utf8');
- Use the Bulk API to load the company data from the modified file into Elasticsearch
curl --user username:password -XPOST 'https://elasticurlhere.us-east-1.aws.found.io:9243/_bulk?pretty' -s -H 'Content-Type: application/x-ndjson' --data-binary '@bulkdata.txt'
This simple dump-and-load approach was easy to implement and helped us solve the problem quickly. In the future, however, when we do move to more of a real time replication solution, we will likely rely on Elasticsearch, Beats and Logstash.
Step 4 - Run Queries
As mentioned earlier, the ultimate goal is matching a bank transaction descriptor with a company name. To do so,ย we used a series of prefix searches where we send different parts of the descriptor to Elasticsearch and try to get the highest score with the most characters. Each search query looks something like this where โsearch text hereโ is replaced with different parts of the descriptor:
curl --user username:password -XGET 'https://elasticurlhere.us-east-1.aws.found.io:9243/company/_search?pretty' -H 'Content-Type: application/json' -d' { "min_score": 0.1, "from": 0, "size": 20, "query": { "bool": { "should": [ { "multi_match": { "query": "search text here", "fields": [ "name", "aliases" ], "type": "best_fields" } }, { "multi_match": { "query": "search text here", "fields": [ "name", "aliases" ], "type": "phrase_prefix" } } ] } } } '
Just like with type mappings, there are many different variations and options forย Elasticsearch queries. This is yet another very dense topic and again my recommendation is to just start with some example like this that you find on the web and then start tweaking the query until the results match your expectations.
The Results
After spending one day implementing the basic indexing and search functionality followed by another couple days of tweaking the query parameters, we were able to get a 60% hit rate. This order of magnitude improvement was much better than our 5% hit rate and our team was pretty excited. In the months that followed we were able to get it up to a 70% hit rate, which we were satisfied with given the somewhat unpredictable nature of bank transaction formats.
Looking back, Elasticsearch saved the day and allowed us to rock our demo. While this may seem like a simple change, it actually resulted in a much smoother user experience as we were able to hide confusing symbols and characters from descriptors by associating them with recognizable company names.
Final Thoughts
There is so much more to Elasticsearch than what this article covers, but my hope is that this article can help get those of you that have never used it before to finally jump in and try it out. Once you get the basic functionality I have outlined here working, you can start to level up your Elasticsearch skills over time.