February 2, 2017 User Stories

Scaling File System Search with Elasticsearch @ Egnyte

By Kalpesh Patel

In this article, I want to share the journey of transforming our search functionality from a home grown distributed search solution to Elasticsearch, and how that switch helped to reduce the amount of manpower required to maintain and scale the system that powers search over billions of files and petabytes of content all while saving us money in return.

Our Use Case: Searching Files and its Content Stored by Customers

Egnyte powers smart content collaboration and governance in the cloud and on-premise. One of our most popular products is Egnyte Connect which powers secure file storage and sharing (in both cloud and on-premise environments) for tens of thousands of customers including Nasdaq, Buzzfeed, RedBull, AppDynamics, Yamaha, and more. In each of these cases, the customer stores millions of files and when employees are looking for a specific one or related topic, the search functionality they would use is powered by our Elasticsearch clusters.

egnyte-1.png

The initial architecture was home grown and it was an array of 100+ machines, called “indexers”, each hosting metadata translation/full text extraction/indexing/search code and a Lucene index. A group of customers were pinned to an indexer node by the routing engine and system was prone to many issues like:

  1. Hotspots : Two or more big customers would land on to same indexer causing search performance to be non-deterministic.
  2. No Replication/HA/Failover : each indexer would store the lucene index but there was no copy and if machine goes down we had to reindex while taking a downtime.
  3. SPOF : If indexer goes down due to write load then all customers on the shard would be affected even for read queries.
  4. Manual rebalancing : Hotspots were eliminated manually by resharding and copying data from one machine to another.
  5. Inferior Scaling : Adding new nodes would require downtime for rebalancing and index moves.

A constant capacity of DevOps was dedicated to baby-sit indexers and move data accordingly. Unsurprisingly, a big percentage of customer complaints would be related to search quality and performance. And we found ourselves saying ‘hey, you can’t expect customers to find a file in a haystack without search.’

The Solution: Separation of Concerns by Migrating to Elasticsearch

We were looking for a solution that had:

  1. Replication support
  2. Elastic scaling
  3. Auto rebalancing
  4. No SPOF
  5. REST API interface
  6. Good documentation
  7. Active community support
  8. Cluster monitoring tools

Solr and Elasticsearch both fit the requirements, I had used Solr for a side project previously. It was a good search solution but making it distributed was configuration heavy and a lot of scripting was needed to make autoscaling work. Conversely, to autoscale in Elasticsearch, you just pick the proper number of primary shards and replicas in your index and add more nodes and the cluster will auto rebalance. We ended up choosing Elasticsearch as its API was much cleaner and intuitive and it was built with a distributed mindset from the outset.

We separated the concern of scaling/monitoring/managing the search dataset to Elasticsearch and trimmed down “indexer” code to contain business logic of in-order event processing, metadata translation, content extraction. Plus, since we no longer store the lucene index, we added HA to indexers. We removed huge amount of home grown indexing code and instantly became more scalable. In total, we run close to 80 elasticsearch nodes and 30 indexer nodes and we are easily able to add more nodes on demand.

Elasticsearch takes care of auto scaling/rebalancing/replication by adding more nodes/indexes and as a result, we now only have just 1 devops and 1 engineer taking care of entire search stack instead of the constant crew that was needed in the past. This has resulted in significant cost savings. More importantly, though, it has reduced burnout among the team as they can now work on what they love and let Elasticsearch automatedly take care of most of the scaling stuff.

Technical Deep Dive

As users uploads files and folders and search them, we use the Apache Tika library to extract full text content out of the files and index that along with file metadata (user, size, filename, folder path, etc.). At the core, there are 2 main models in schema:

  1. Folder : This is a denormalized model containing full folder path and some attributes, this was added mainly to make folder searches faster.
  2. File version: File and folder level attributes are denormalized at “File version” model for search performance.

At a high level, when a file or folder is added/mutated we log an event into event store from Cloud File Server(CFS) and publish a message to RabbitMQ which wakes up indexers to index the events in order into elasticsearch. At 10,000 ft, the flow looks like:

egnyte-2.png

Challenges

  1. Sizing cluster and indexes : We have a mixed bag of customers, lots of SMBs with <1M files and lot of medium to big customers with 10-100M files. As we were indexing full text content the index size was exploding, so we decided to index first 1MB of extracted text but even then, the estimated size, with replication, was hovering around 100+ TB. To reduce the impact of cluster downtime we decided to spin 4 clusters and within each cluster, we created 100 or more indexes using “Index Templates” with 10 shards each. We kept a mapping database in our routing engine and assigned a customer to 1 index in one of the clusters. This design allows us to scale by adding more indexes to existing clusters or just simply add new clusters.
  2. Deleted documents : Our entire dataset is active with customers being able to mutate the same files and folders all at the same time. As a result, when a customer moves a folder that has 1M files to somewhere else within the hierarchy, we have to immediately update 1M documents in Elasticsearch in order to reflect the new path on all those files. When a customer moves a folder with 1M files in subfolder hierarchy to the trash, we need to update 1M documents in Elasticsearch to mark them all as deleted. As a result, we accumulate lots of deleted documents in our cluster. For example, one of the clusters has 4 billion active documents and 1 billion deleted documents. We haven't had a chance to take a crack at this issue, so for now we just keep spare disk capacity in the cluster in order to handle these.
  3. Split brain: We once encountered split brain in cluster as we weren’t using dedicated master nodes. After this downtime we added 3 dedicated master nodes.
  4. Heap : We were running into heap issues with Elasticsearch 1.7 and we found Fielddata cache was consuming most of the heap. We changed the datatype of field to use Doc values and reindexed the entire cluster data to fix this issue.

Infrastructure Footprint

We deliberately made the decision to spin up our Elasticsearch clusters in Google Cloud Engine and indexers in our data center index to Google cloud. This allowed us to do frequent iteration over hardware spec generation. We started with 16CPU, 64G and 1T nodes, then moved to 16CPU, 64G and 2T nodes and recently we moved to 10CPU, 64G and 3T nodes as Google told us that we can save money by reducing CPU. By being in the public cloud and with the Elasticsearch architecture allowing us to treat nodes like cattle, we can easily provision a new node and have it join the cluster and let it replicate the data and tear down old generation nodes. Currently we have:

  1. 4 clusters
  2. 3 client and 3 master nodes in each cluster
  3. 65+ data nodes across 4 clusters
  4. 20 text extraction nodes
  5. 30 indexing nodes
  6. Puppet is used to deploy the entire stack.

Future of Elasticsearch at Egnyte

Egnyte is increasing its product footprint and moving into data governance. Providing better recommendations, content classification and using machine learning are on the roadmap. We are doing several Proof of concept implementations using Elasticsearch in some of these endeavors.

Learnings

  1. Design for reindexing: If you can reindex then you are in good hands.
  2. Use delayed reallocation: In public cloud like GCE a network issue can occur and node can get lost and join again, you don’t want to reallocation to trigger immediately.
  3. Use replicas: Having more than 1 replica allows you to sleep peacefully and changes your approach from “Pets to Cattle”.
  4. Monitor: You can’t fix what you can’t monitor. I am just waiting for Marvel to reach production so we can trend on the stats.

egnyte-3.png
Kalpesh Patel is a Distinguished Engineer at Egnyte. He is one of the founding engineer and owns Infrastructure components like Distributed Filesystem, Object storage, Search and APM at Egnyte.