With over 17 years of software engineering experience, Shawn Anderson is currently working with the Architecture Development team at C.H. Robinson. Their primary focus is exploring emerging technologies and methodologies, and assisting teams at C.H. Robinson in applying them. Outside of work, he enjoys spending time with his 2 daughters, running, movies, and gaming of all types.
As one of the world's largest third party logistics (3PL) providers, C.H. Robinson (CHR) deals with a lot of data. We provide freight transportation and logistics, outsource solutions, produce sourcing, and information services to over 46,000 customers through a network of offices in North America, South America, Europe and Asia. To meet our customers' freight needs, we provide access to over 66,000 transportation providers worldwide.
Accessing Shipment Data
In 2014, CHR handled approximately 14.3 million shipments. The transactional data supporting those shipments is represented by multiple terabytes of SQL data. In order to make the best decisions for our customers, our networks need access to this information as quickly as possible. As shipment volume increases and customer search requirements mature, we run into roadblocks with traditional SQL queries. For example:
- Having to limit search results to achieve acceptable performance
- Limitations with text searching when not using SQL Full-Text Search features
- Changing SQL indexes for performance on the fly (and in worse case scenarios, causing other queries to no longer perform)
Elasticsearch has helped C.H. Robinson get past these roadblocks and opened new doors to providing an enjoyable user search experience.
Dipping Our Toes in Elasticsearch
Our first challenge was getting our SQL data into Elasticsearch. Fortunately, we had recently completed a service bus initiative - it was the perfect fit for building a pipeline of data into Elasticsearch. As data changes come in from different sources, notifications are sent out to be picked up by load processes. These processes gather the relevant SQL data for searching and forward it on to Elasticsearch. This solved the day-to-day operations, giving us "near real-time" data for searching.
The second problem to solve was bulk loading SQL data to backfill into Elasticsearch. We explored Rivers - both JDBC and RabbitMQ (the backbone of our service bus). But with the deprecation of Rivers and some of the problems they presented, we opted to build a custom bulk load process to ensure we could control the load on both sides of the pipeline. Initially, we used a distributed, multi-threaded approach, but quickly ran into bulk rejections. We increased the bulk threadpool queue size slightly (200), and throttled our bulk inserts to a more manageable level. This solved our initial data load problem and gave us a push button "rebuild this index" when needed.
With all this data now funneling into Elasticsearch, we were free to create much richer user search interfaces, including:
- Single input box style searching
- Type ahead auto complete functionality (geographical cities, customer lists, etc.)
- Highly performant grid style search results in milliseconds
The future of Elasticsearch @ C.H. Robinson
One of the biggest challenges when moving to Elasticsearch is understanding user search expectations, and explaining what Elasticsearch relevancy means to your stakeholders. Users don't want to read theories about TF/IDF, they just want to know why certain results are or are not being returned. This has been a challenge, because we have so many search use cases in different parts of the organization. However, initial reaction from our users on new search capabilities has been great, and we've already been able to iterate on feedback to improve our queries. Some improvements we’ve made include:
- Adding boosting to certain fields
- Use of multiple clauses and boosting
- Multi-fields to expand search capabilities
- Using not_analyzed for sorting & aggregations and a keyword analyzer / lowercase tokenizer & standard analyzer combination for searching
- not_analyzed for sorting is ideal from a fielddata perspective, but has some challenges regarding case-sensitivity. However, that problem is being worked on, and likely to be addressed in a future release!
- Cleaning up what is included in _all (i.e. removing date time stamps, and other non-relevant data for _all searches such as types and enums)
- Up-front query parsing to "guess" the users’ intentions
It's clear we're only seeing the tip of the iceberg in terms of potential use cases and problems that Elasticsearch might help solve. Partnering with Elastic has given us a great starting point, assisting us in building out a search cluster that meets our initial use cases and data needs, with room for growth. If Elasticsearch were a video game, it would definitely follow Bushnell's Law - easy to learn, hard to master. As we look to the future, we hope to utilize many more features of Elasticsearch such as:
- Geospatial relevancy
- Index aliases for rolling transactional data and easier index management
- Watcher for alerting - "Notify me anytime a shipment in my region, with the properties that I care about, enters the system."
The views express on this website/weblog are mine alone and do not necessarily reflect the view of my employer. I am not authorized to make statements on behalf of C.H. Robinson.