November 12, 2013

Building Enterprise Content Management Systems on Elasticsearch

Today we’re bringing you another story of Elasticsearch’s use in the field: Vodori and Pepper, their home grown enterprise content management system based on Alfresco and Elasticsearch. Vodori is a full service digital agency that blends Strategy, Design, Technology, and Product to deliver high-velocity digital global marketing solutions. They are based in Chicago, Illinois, US and they’re the gracious hosts of the Chicago Elasticsearch Meetup. Grant Gochnauer Vice President and Co-Founder of Vodori was kind enough to share their story with us.

Two and a half years ago, Vodori set out to redesign its enterprise content management system, Pepper, to delivery high-velocity digital marketing to our customers. With a global scale and complex content management and distribution requirements, Pepper had to be flexible, scalable, and of course provide rich search capabilities. In 2010, we evaluated technologies that would enable us to meet our product goals and decided on two core technologies: Alfresco and Elasticsearch.

Why Elasticsearch?

Before diving into our decision to use Elasticsearch, it helps to understand a bit more about some of our requirements. Content management systems base their data storage on an “object model” or “content model” allowing developers to specify how content types are defined and relate to one another in a system. For example, we might have an object called “Document” that contains properties such as “title” and “filename”. We can then extend “Document” with a child object in our object model – a “WebDocument”. It has additional properties such as “seoDescription” and “url”. “WebDocument” inherits the properties defined on “Document”.

Pepper provides an object model as part of the core product but allows (and encourages) implementation teams to extend this model depending on the business requirements for the solution being delivered. Alfresco provides its own “object model” out of box and we extend this concept significantly for Pepper in our searching paradigm. Because project implementations can contain widely varied content types and properties, we have to support the myriad ways our customers may want to search and filter content.

While Alfresco implements CMIS, the industry standard for interacting with content data (creating, updating, searching), the performance did not meet Pepper’s specific needs. Our challenge: how to get content out of Alfresco (and by extension Pepper) in such a way that:

Implementation teams could write complex queries based on client-specific business requirements
The Pepper team could support more robust control on our searches including sorting, faceting, and paging.
We could loosely couple Pepper’s user interface and business logic to the underlying data stores (Alfresco and other relational stores), enabling more elegant evolution of each tier.

Enter Elasticsearch. After evaluating a number of technology options, we selected Elasticsearch because it met our requirements and also because of the trust we had in the people and product gained through the successful use of Shay Banon’s previous technology, Compass. By introducing this facade layer on top of Alfresco, we not only introduced a high-performance and scalable way to interact with our data, it also gave us complete control of how we search content.

Let’s look at an example: Pepper Library

One of the many ways users can find and interact with content is through the Pepper Library.

We allow customers to toggle a number of parametric filters as the results appear through an endless scroll list. Elasticsearch provides everything we need to implement this user experience out of the box.

When users toggle filters along the left panel, we simply construct the appropriate Elasticsearch query using the TermsFilter combined with a BoolFilter and Elasticsearch immediately provides the data we need.

We also provide for free-text search by using the search box in the upper right corner of the Library. Thankfully this is also extremely easy to do with Elasticsearch – even when searching with the left panel filters applied. If a user types in a search query, we simply append a Match Query to the existing BoolFilter in our previous query. Elasticsearch happily returns what we asked for. We can further customize the way free-text search works by implementing custom Index and Search analyzers for each property.

Modelling Content in Elasticsearch

One of the challenges we faced when building our architecture was making it easy for product engineering to create content models without worrying about the details of Elasticsearch. Similarly, we wanted to provide a layer of abstraction so that our implementation teams could focus on the client solution and not on the intricacies of Elasticsearch mapping and index creation.

We built a framework that allows us to annotate our Java content object model in such a way that generates Elasticsearch mappings that abstract the complexities from our engineers. Our framework provides the contract between our product API’s and Elasticsearch.

In this example, “logicalName” represents the filename of a document exposed to the end-user which is different than the system generated physical filename we store in Alfresco. With our @ElasticSearchProperty annotation, we can specify the following:

type: This is the field type that Elasticsearch uses to define how this value is indexed
filterable: Whether we want to be able to use Elasticsearch Filters on this value. When set to true, we don’t want to analyze the values.
freeTextSearchable: Whether we want to index the analyzed value for free text searching.
searchAnalyzer: The custom analyzer we use when searching.
indexAnalyzer: The custom analyzer we use when indexing the document.

When using a combination of “filterable” and “freeTextSearchable” we index the value in a “multi_field” type in Elasticsearch so that we can use either the analyzed or non-analyzed value. Our framework is smart enough to know when to use each.

Parent child relationships

The other Elasticsearch modelling feature we use heavily is Parent/Child relationships. We needed to support the ability to manage and distribute a single document in many different distribution channels (Website, iPad, social media, etc). By leveraging the Parent/Child relationships, we can index a single source document which represents the canonical binary asset and an “index card” that represents a pointer back to the source document with additional descriptor metadata. We can then create additional index card objects in each channel of distribution, all pointing back to the same source document. The Index card becomes the parent document with the source document becoming the child document.

Our framework also extends this idea with another annotation on the child source document:

So why would we want to use a parent/child relationship in this way? Among other things, it allows us to write really interesting queries to fetch all content for a distribution channel while also filtering on values in the source document. Let’s look at two examples.

Our customers store product marketing and catalog data within Pepper. As we mentioned earlier, this content is leveraged across many different distribution channels. As a global marketing manager, I may need to search across the entire content ecosystem and find where a particular set of documents are being used. With Elasticsearch and our Parent/Child relationships, we can execute queries that answer this question by leveraging the has_child filter. We simply search for child documents with a specific attribute value (e.g. productCatalogId) and Elasticsearch will return all the parent documents. Parent documents represent our “index card” which is specific to a distribution channel and contains specific usage metadata for that channel.

{
   "from":0,
   "size":2,
   "filter":{
      "has_child":{
         "type":"com.vodori.pepper.content.model.document.PepperWebpage",
         "query":{
            "term":{
               "productCatalogId":"CRX1234A"
            }
         }
      }
   }
}

We also need to be able to serve and query content by distribution channel. For example, when our iPad application, Pepper Mobile, needs to fetch all the content approved for consumption from within the iOS application, we can perform this search using the has_parent filter. In this case, we want to find child “source documents” where the parent object has a specific attribute value such as a distributionChannelId value of 5.

{
   "from":0,
   "size":2,
   "filter":{
      "has_parent":{
         "type":"com.vodori.pepper.content.model.descriptor.PostDescriptor",
         "query":{
            "term":{
               "distributionChannel":"5"
            }
         }
      }
   }
}

What’s next for Pepper’s usage in Elasticsearch?

We have a number of new features on our roadmap that will be based on Elasticsearch. A few examples include:

Index Aliasing: This will allow us to rebuild our Elasticsearch Index while the existing index remains in service. Once complete, we can simply update the alias so that Pepper points to the newly rebuilt index. Zero downtime.
Updating Mappings API: Elasticsearch provides capabilities to update its mappings on the fly and so when we release new versions of Pepper, it will be smart enough to inspect the current mappings and apply an update to the index based on changes in the Product version requirements.
Distributed Percolator: We are very excited about this 1.0 feature because it will allow us to build a mechanism for our customers to subscribe to content updates in the system and be notified of any kind of relevant changes or compliance issues that they specify. It allows our products to be proactive about what’s going on in the system based on arbitrary business criteria defined by our customers. Very exciting.

Wrapping up

It has been really exciting to see the Elasticsearch product and team grow over the last two years. The Elasticsearch community has been fantastic and really reinforces the fact that we made the right decision on our choice to use Elasticsearch. We think that this technology will continue its growth into new industries and applications as more companies realize the power and flexibility of Elasticsearch. We are excited to be a part of that growth.

Many thanks to Grant Gochnauer for sharing their experiences with us. We are always excited to share stories from the community about how they are using Elasticsearch, so please reach out to me any time you’d like to share you story.