18 November 2015 User Stories

Elasticsearch plus StreamSets for reliable data ingestion

By Arvind Prabhakar

StreamSets Data Collector is open source software that lets you easily build continuous data ingestion pipelines for Elasticsearch. By being resistant to "data drift", StreamSets minimizes ingest-related data loss and helps ensure optimized indexes so that Elasticsearch and Kibana users can perform real-time analysis with confidence.

Elasticsearch provides a powerful platform for real-time analytics at scale. Reliable and high-quality data ingestion is a critical component of any analytics pipeline. The recently launched StreamSets Data Collector provides an enhanced data ingestion process to ensure that data streaming into Elasticsearch is pristine, and remains so on a continuous basis.

The StreamSets Data Collector provides an open source ingestion infrastructure that helps you build continuously curated ingestion pipelines, and improves upon legacy ETL and hand-coded solutions in three important ways:

  • it automatically adapts to changes in schema, semantics, and infrastructure;
  • it prepares and cleanses data in-stream and;
  • it provides early warning and graceful error handling to keep pipelines operational and reliable.

StreamSets customers ingest data from a variety of sources, including 3rd-party data feeds, internal systems, and physical sensors. The more they sanitize their data while in motion, the greater the confidence in their analysis using Elasticsearch and Kibana. The less downtime and data loss in their data pipelines, the more complete and accurate their analysis is.

Data Drift Breaks and Corrodes Pipelines

A chief culprit in the battle for data quality and pipeline reliability is data drift, which is the accumulation of numerous unanticipated changes that occur in data streams. Data drift can break ingest pipelines or corrupt data.

In the brittle world of schema-centric ETL, opaque pipelines can break due to the smallest upstream alteration - such as a re-ordering or renaming of fields or minor changes in data type formatting.  Each time there is an upstream change it risks stopping the ingest process in its tracks, or worse - causing silent data corruption that goes unnoticed for months. When ETL pipelines break, doing forensics and diagnosing their logic is difficult if not downright impossible for large distributed pipelines.  The entire pipeline is a black box requiring substantial effort to debug and get back online.

Data drift also creates an insidious second consequence. It acts as a silent killer of data quality. When semantic changes are missed or ignored - as is common - data quality erodes over time with the accumulation of dropped records, null values and changing meaning of values.

The operational impact of data drift and the resulting corrosion is that the results of real-time analysis become unreliable or, due to frequent outages, the analysis becomes impossible to perform. The business impact is that bad data leads to bad decisions and entire areas of analytic potential are left on the shelf.

StreamSets Makes Continuous Big Data Ingest Simple

StreamSets was designed from the ground up to allow customers to easily build reliable pipelines in the face of data drift. You can use a drag-and drop interface to connect sources to Elasticsearch and also connect pre-built data preparation functions such as field parsing or PII masking.  You can route data based on pre-conditions such as creating an error queue for unexpected values.

As an example, the screenshot below shows a pipeline for analyzing payment transactions using Elasticsearch as a destination (1).  The pipeline includes data routing for credit card vs. cash transactions (2), masking of credit card data (3) and routing of suspect location data into a Kafka message queue (4).  All of this was set up, tested, activated, and monitored without writing any code.


In addition to its user interface, StreamSets provides a wide array of APIs for extensibility and customization, making it not a special purpose tool but rather a core piece of modern data infrastructure.  Early uses for it have been wide and varied, including:

  • A price prediction service needing to analyze data from hundreds of external data sources.
  • A bank wanting to create a high-quality and reliable customer “event firehose” for all their customer interaction channels.
  • Meet QoS targets by improving system monitoring for a global “cloud of clouds”.
  • Social data analysis across data streaming from hundreds of application nodes.

Elasticsearch and StreamSets are naturally complementary and critical components in a real-time analytics architecture.  Customers can use StreamSets to connect a wide variety of sources to both Elasticsearch and Found to provide real-time search capabilities using a constantly curated data flow. Both are built for real-time in-memory performance. Both scale out gracefully through distributed architectures. And both are open source.

streamsets2.pngArvind Prabhakar is a seasoned engineering leader, who has worked on data integration challenges for over ten years. Before co-founding StreamSets, Arvind was an early employee of Cloudera, and led teams working on integration technologies such as Flume and Sqoop. A member of the Apache Software Foundation, Arvind is heavily involved in the open-source community as the PMC Chair for Apache Flume, the first PMC Chair of Apache Sqoop, and member of various other Apache projects. Prior to Cloudera, Arvind was a software architect at Informatica, where he was responsible for architecting, designing, and implementing several core systems.