July 12, 2017 User Stories

How AerData improved OCR search capability using Elasticsearch

By Maxim AlferovNikita SavinovPeter Tabangan

About AerData

AerData, a Boeing company, provides integrated software solutions for lease management, engine fleet planning, records scanning as well as technical and back office services for aircraft and engine operators, lessors and MROs (Maintenance, Repair and Overhaul). With a strong customer focus, AerData delivers a reliable and secure service to its clients using latest technologies and state of the art infrastructure.

In 2002, AerData developed STREAM (Secure Technical Records for Electronic Asset Management), to deliver a solution for the management of hard copy technical records. This allows an airline, lessor or MRO, to scan, index and centralize all its records for sharing both within and outside its organization.

This is achieved via a DMS (Document Management System), the Microsoft SQL Server database, which encompasses fast access searching and indexing of text and associated metadata. This can then be customized as per the client's specification via a powerful OCR (Optical Character Recognition) search tool.

Challenges

This solution worked satisfactorily for many years, but as we acquired more customers and amassed more data, the database became increasingly difficult to maintain. To stay ahead of the market and still achieve our SLAs (Service Level Agreements), additional resources were required to improve performance and functionality etc. We quickly realized that our current solution was no longer fit for purpose due to the following reasons:

    1. SQL Server Full-Text search has a lack (or limited functionality) to:
  • Rank the results
  • Generate suggestions (eg "search-as-you-type")
  • Scale-out easily
    2. Difficult to combine multiple filters. Very often users want to search for a combination of terms (e.g. aircraft serial number, date and some specific keyword in the documents). The more filters we have, the more joins we need, and when it comes to millions and millions of records, the query becomes unacceptably slow (from several seconds up to 3-5 minutes)
    3. Due to the large amount of data, making backups and restoring them is a time-consuming process

Specification

Firstly, we put together a list of essential requirements as a baseline for choosing the right tools for the job. Below are some of the OCR project requirements:

      1. Ability to perform full and partial text search with a combination of metadata
      2. Multi language support (e.g. Chinese)
      3. Can easily integrate into our existing application
      4. Can store and update complex objects
      5. Handles multiple tenants
      6. Maintainable infrastructure and deployment without downtime
      7. Scalable

Considering these requirements and bearing in mind our very successful logging project using Elasticsearch with Kibana, this seemed a logical route to follow.

Next, we defined the minimum amount of metadata that we needed to store in Elasticsearch, then we created field mapping utilizing search performance and disk space. As we progressed the process, we kept adding metadata definition. For these ongoing schema changes, we used the Elasticsearch dynamic template feature to version our schema/field mapping which helped to easily track down changes.

As part of the search requirements, we created a custom tokenizer to search for full and partial keywords. This gave us the ability to perform whole word searches and search on a subset of characters (e.g. finding the word "air" in "airline").

Data migration

At the beginning of the project, we had amassed a lot of data which needed to be indexed in Elasticsearch (millions and millions of records over multiple customers). The easiest approach was to take records individually and index them, but this simply wouldn't work as it took several days to index an average customer database. Thankfully Elasticsearch offered some performance tuning which immediately improved the performance of the migration process.

Data synchronization

After the initial indexing, we had to make sure that our data was up-to-date and synchronized. To achieve this, we implemented the following solution:

      1. Each data change in our primary storage is saved as an "event"
      2. Each event is published into a message queue
      3. Our solution is listening to those events and when they arrive, the service updates the data.

There are two types of events:

  • Single-entity update events (when only one document is updated in Elasticsearch). For these events we use the _update API
  • Multi-entity update events. For those we use update-by-query API.

All events were re-tried at least three times, and in the event of any failure with exponential backoff, our system proved to be resilient to potential errors/outages.

Querying

As part of the business requirements, we needed a query that combined multiple metadata with full text search.

Despite Elasticsearch providing a rich and well-documented REST API, we needed to perform a lot of dynamically-generated queries. Elastic provides a lot of SDK libraries for different platforms to achieve that, but since our application is written in .NET we used NEST (Elastic's official .NET client).

This library has a great query-builder which hides all the complexities of the Elasticsearch JSON (JavaScript Object Notation) queries. This NEST library helps to construct complex queries and provides us with great abstraction over Elasticsearch.

The only thing we were missing here was documentation – NEST has tutorials, but only covers 30-40% of the functionality, which meant diving into the NEST source code from time to time.

Integration to existing application

To integrate the OCR application to our existing application, we built a custom API on top of Elasticsearch. Our custom API has the business logic to process search requests and search responses. This approach alleviated exposure of the Elasticsearch API to the outside world mitigating security issues.

Infrastructure of OCR project

One of our goals was to make sure that we can easily use Elasticsearch in production with no downtime, minimizing manual intervention and other operational costs. In order to achieve this, we needed to automate the process as much as possible and also make sure that Elasticsearch supported Puppet and the IaC (Infrastructure as Code) approach we adopt, luckily Elasticsearch has great support for it.

Elasticsearch supports a variety of extensions including Puppet Forge, which is a marketplace for all Puppet extensions (https://forge.puppet.com/elastic), plus there are a lot of third-party extensions as well. Those extensions are capable of automation of configuration management, Elasticsearch instance creation, restarting and so on. Based on Puppet and these extensions, we created a continuous deployment scheme so that:

      1. All infrastructure code is hosted in git
      2. Build is done in Visual Studio Online
      3. Changes are replicated to the cluster by Puppet

Success story

Looking back, the project was certainly challenging, but proved to be a great achievement for the team. We now have a robust and fully customizable system which means we can now easily fulfill any complicated query requirements as well as scale up/out or even switch off the machines when we don't need them.

For the OCR project, we used a typical set of environments, namely development, staging and production. The timescale for configuration changes is now less than 5 minutes on development and test environments and 30 minutes on production.

We also have a plan to incorporate automatic module testing and integration testing in our IaC development pipeline including infrastructure and configuration changes. This ensures we will be more robust against the high tempo of Elastic Stack developing as its functionality continuously improves.


Contributors

Peter Tabangan

Software Engineer at AerData, working as a backend developer. Enjoys learning new technology and using it to full capacity. Also interested in machine learning and IOT and enjoys watching TV series in his free time.

Nikita Savinov

Software Engineer at AerData, working as a full stack developer. Enjoys solving complex problems with easy solutions.

Maxim Alferov

DevOps Engineer at AerData, responsible for developing automation for the management of our SaaS environment. Enjoys automating infrastructure by turning it into code with modern configuration management tools.