John-Henry is a product manager with Search Technologies, an IT service firm specializing in the design, implementation, and management of search and big data applications. Search Technologies has developed software, such as Aspire, that have helped make Elastic projects successful.
Today, I am excited to announce that Search Technologies, an Elastic partner has released Aspire for Elasticsearch. Aspire is a content access and processing framework specifically designed for unstructured data. It enables content from a variety of content repositories to be acquired, cleaned, normalized and enriched, as part of Elasticsearch implementations. I encourage you to download Aspire for Elasticsearch, it is free for a 45-day trial.
Connectors and Security
Let’s talk a little bit about how Aspire is being used in conjunction with Elasticsearch. A large management consulting firm needed to connect web content, several large databases, SharePoint 2010, and SharePoint 2013 to index in Elasticsearch. Elastic offers Logstash, which focuses on structured log and event data, not unstructured data like office documents and web content. Aspire for Elasticsearch comes pre-packaged with connectors for files systems, RDB and web content. This enables users to start indexing that content immediately and add connectors for nearly 20 content stores such as Box.com, Documentum, Salesforce, etc. as needed. And since Aspire uses a modular framework, these connectors can be licensed and plugged in at any time.
This management consulting firm needed to retrieve access control lists (ACLs) to maintain document-level security throughout the Elasticsearch application as well as extract metadata and content from SharePoint. Aspire SharePoint connectors support this requirement. Both Aspire and Shield address security concerns, however, it is important to understand the different and complementary roles they play. Shield provides the foundation-level of security needed to run Elasticsearch in production - protecting your cluster with a username/password and providing more advanced features like encryption and role-based access controls. Aspire provides support for document level security, what a user has the rights to see from a query based on the source repositories ACLs.
The need for content processing
By its nature, unstructured content is prone to be inconsistent with incorrect or missing metadata, poor granularity, extraneous content and erratic term usage. Content processing prior to indexing is critical to the success of search and analytics applications utilizing unstructured data. For example, a large recruiting customer’s Elasticsearch solution required extensive content processing. They had inconsistent date formats along with upper and lower case usage in candidate resumes, causing poor search results.
Aspire is used to process the content to normalize the date formats and case usage to solve this challenge. In some cases, multiple resumes come in XML blocks and need to be separated into individual documents. Aspire identifies the individual resumes and is able to capture relevant entities (location, titles, company names, etc.) in resumes based on rules (patterns, capitalization, etc.) to create metadata for later analysis for Elasticsearch. An innovative use of content processing in this solution is using Aspire’s integration with Hadoop to support vector creation for a document matching feature.
Aspire Enterprise for Elasticsearch, just like Marvel and Shield is initially provided as a time restricted “try and buy” and can easily be converted to full licensing at any time. This is a full feature offering including:
- CIFS, RDB snapshot, Heritrix connectors
- Support for Elasticsearch for Hadoop
- Complete library of content processing components
- Ability to expand role-based security metadata
For more details on functionality or to download, go to Aspire for Elasticsearch