« Apache Hive integration Mapping and Types »

› ›

Apache Spark support

	Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala and Python, and an optimized engine that supports general execution graphs.
	-- Spark website

Spark provides fast iterative/functional-like capabilities over large data sets, typically by caching data in memory. As opposed to the rest of the libraries mentioned in this documentation, Apache Spark is computing framework that is not tied to Map/Reduce itself however it does integrate with Hadoop, mainly to HDFS. elasticsearch-hadoop allows Elasticsearch to be used in Spark in two ways: through the dedicated support available since 2.1 or through the Map/Reduce bridge since 2.0. Spark 2.0 is supported in elasticsearch-hadoop since version 5.0

	Spark Scala imports
	elasticsearch-hadoop Scala imports
	start Spark through its Scala API
	`makeRDD` creates an ad-hoc `RDD` based on the collection specified; any other `RDD` (in Java or Scala) can be passed in
	index the content (namely the two documents (numbers and airports)) in Elasticsearch under `spark/docs`

	`EsSpark` import
	Define a case class named `Trip`
	Create an `RDD` around the `Trip` instances
	Index the `RDD` explicitly through `EsSpark`

	Spark Java imports
	elasticsearch-hadoop Java imports
	start Spark through its Java API
	to simplify the example, use Guava(a dependency of Spark) `Immutable`* methods for simple `Map`, `List` creation
	create a simple `RDD` over the two collections; any other `RDD` (in Java or Scala) can be passed in
	index the content (namely the two documents (numbers and airports)) in Elasticsearch under `spark/docs`

	statically import `JavaEsSpark`
	define an `RDD` containing `TripBean` instances (`TripBean` is a `JavaBean`)
	call `saveToEs` method without having to type `JavaEsSpark` again

	example of an entry within the `RDD` - the JSON is written as is, without any transformation, it should not contains breakline character like \n or \r\n
	index the JSON data through the dedicated `saveJsonToEs` method

	example of an entry within the `RDD` - the JSON is written as is, without any transformation, it should not contains breakline character like \n or \r\n
	notice the `RDD<String>` signature
	index the JSON data through the dedicated `saveJsonToEs` method

	Document key used for splitting the data. Any field can be declared (but make sure it is available in all documents)
	Save each object based on its resource pattern, in this example based on `media_type`

	`airportsRDD` is a key-value pair `RDD`; it is created from a `Seq` of `tuple`s
	The key of each tuple within the `Seq` represents the id of its associated value/document; in other words, document `otp` has id `1`, `muc` `2` and `sfo` `3`
	Since `airportsRDD` is a pair `RDD`, it has the `saveToEsWithMeta` method available. This tells elasticsearch-hadoop to pay special attention to the `RDD` keys and use them as metadata, in this case as document ids. If `saveToEs` would have been used instead, then elasticsearch-hadoop would consider the `RDD` tuple, that is both the key and the value, as part of the document.

	Import the `Metadata` enum
	The metadata used for `otp` document. In this case, `ID` with a value of 1 and `TTL` with a value of `3h`
	The metadata used for `muc` document. In this case, `ID` with a value of 2 and `VERSION` with a value of `23`
	The metadata used for `sfo` document. In this case, `ID` with a value of 3
	The metadata and the documents are assembled into a pair `RDD`
	The `RDD` is saved accordingly using the `saveToEsWithMeta` method

	Create a `JavaPairRDD` by using Scala `Tuple2` class wrapped around the document id and the document itself
	Tuple for the first document wrapped around the id (`1`) and the doc (`otp`) itself
	Tuple for the second document wrapped around the id (`2`) and `jfk`
	The `JavaPairRDD` is saved accordingly using the keys as a id and the values as documents

Scala type	Elasticsearch type
`None`	`null`
`Unit`	`null`
`Nil`	empty `array`
`Some[T]`	`T` according to the table
`Map`	`object`
`Traversable`	`array`
case class	`object` (see `Map`)
`Product`	`array`

Java type	Elasticsearch type
`null`	`null`
`String`	`string`
`Boolean`	`boolean`
`Byte`	`byte`
`Short`	`short`
`Integer`	`int`
`Long`	`long`
`Double`	`double`
`Float`	`float`
`Number`	`float` or `double` (depending on size)
`java.util.Calendar`	`date` (`string` format)
`java.util.Date`	`date` (`string` format)
`java.util.Timestamp`	`date` (`string` format)
`byte[]`	`string` (BASE64)
`Object[]`	`array`
`Iterable`	`array`
`Map`	`object`
Java Bean	`object` (see `Map`)

	Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams.
	-- Spark website

Scala type	Elasticsearch type
`None`	`null`
`Unit`	`null`
`Nil`	empty `array`
`Some[T]`	`T` according to the table
`Map`	`object`
`Traversable`	`array`
case class	`object` (see `Map`)
`Product`	`array`

	`Metadata` `enum` describing the document metadata that can be declared
	static import for the `enum` to refer to its values in short format (`ID`, `TTL`, etc…)
	Metadata for `otp` document
	Metadata for `sfo` document
	Tuple between `otp` (as the value) and its metadata (as the key)
	Tuple associating `sfo` and its metadata
	`saveToEsWithMeta` invoked over the `JavaPairRDD` containing documents and their respective metadata

	statically import `JavaEsSpark` class
	create an `RDD` streaming all the documents starting with `me` from index `radio/artists`. Note the method does not have to be fully qualified due to the static import
	return only values of the `PairRDD` - hence why the result is of type `JavaRDD` and not `JavaPairRDD`

	Spark and Spark Streaming Scala imports
	elasticsearch-hadoop Spark Streaming imports
	start Spark through its Scala API
	start SparkStreaming context by passing it the SparkContext. The microbatches will be processed every second.
	`makeRDD` creates an ad-hoc `RDD` based on the collection specified; any other `RDD` (in Java or Scala) can be passed in. Create a queue of `RDD`s to signify the microbatches to perform.
	Create a `DStream` out of the RDD`s and index the content (namely the two _documents_ (numbers and airports)) in {es} under `spark/docs
	Start the spark Streaming Job and wait for it to eventually finish.

	`EsSparkStreaming` import
	Define a case class named `Trip`
	Create a `DStream` around the `RDD` of `Trip` instances
	Configure the `DStream` to be indexed explicitly through `EsSparkStreaming`
	Start the streaming process

	Spark and Spark Streaming Java imports
	elasticsearch-hadoop Java imports
	start Spark and Spark Streaming through its Java API. The microbatches will be processed every second.
	to simplify the example, use Guava(a dependency of Spark) `Immutable`* methods for simple `Map`, `List` creation
	create a simple `DStream` over the microbatch; any other `RDD`s (in Java or Scala) can be passed in
	index the content (namely the two documents (numbers and airports)) in Elasticsearch under `spark/docs`
	execute the streaming job.

	statically import `JavaEsSparkStreaming`
	define a `DStream` containing `TripBean` instances (`TripBean` is a `JavaBean`)
	call `saveToEs` method without having to type `JavaEsSparkStreaming` again
	run that Streaming job

	example of an entry within the `DStream` - the JSON is written as is, without any transformation
	configure the stream to index the JSON data through the dedicated `saveJsonToEs` method
	start the streaming job

	example of an entry within the `DStream` - the JSON is written as is, without any transformation
	creating an `RDD`, placing it into a queue, and creating a `DStream` out of the queued `RDD`s, treating each as a microbatch.
	notice the `JavaDStream<String>` signature
	configure stream to index the JSON data through the dedicated `saveJsonToEs` method
	launch stream job

	`airportsRDD` is a key-value pair `RDD`; it is created from a `Seq` of `tuple`s
	The key of each tuple within the `Seq` represents the id of its associated value/document; in other words, document `otp` has id `1`, `muc` `2` and `sfo` `3`
	We construct a `DStream` which inherits the type signature of the `RDD`
	Since the resulting `DStream` is a pair `DStream`, it has the `saveToEsWithMeta` method available. This tells elasticsearch-hadoop to pay special attention to the `DStream` keys and use them as metadata, in this case as document ids. If `saveToEs` would have been used instead, then elasticsearch-hadoop would consider the `DStream` tuple, that is both the key and the value, as part of the document.

	Create a regular `JavaRDD` of Scala `Tuple2`s wrapped around the document id and the document itself
	Tuple for the first document wrapped around the id (`1`) and the doc (`otp`) itself
	Tuple for the second document wrapped around the id (`2`) and `jfk`
	Assemble a regular `JavaDStream` out of the tuple `RDD`
	Transform the `JavaDStream` into a `JavaPairDStream` by passing our `Tuple2` identity function to the `mapToPair` method. This will allow the type to be converted to a `JavaPairDStream`. This function could be replaced by anything in your job that would extract both the id and the document to be indexed from a single entry.
	The `JavaPairRDD` is configured to index the data accordingly using the keys as a id and the values as documents

Apache Spark support

Apache Spark support

Installation

Native RDD support

Configuration

Writing data to Elasticsearch

Scala

Java

Writing existing JSON to Elasticsearch

Scala

Java

Writing to dynamic/multi-resources

Scala

Java

Handling document metadata

Scala

Java

Reading data from Elasticsearch

Scala

Java

Reading data in JSON format

Type conversion

Spark Streaming support

Writing DStream to Elasticsearch

Scala

Java

Writing Existing JSON to Elasticsearch

Scala

Java

Writing to dynamic/multi-resources

Scala

Java

Handling document metadata

Scala

Java

Spark Streaming Type Conversion

Spark SQL support

Supported Spark SQL versions

Writing DataFrame (Spark SQL 1.3+) to Elasticsearch

Scala

Java

Writing existing JSON to Elasticsearch

Using pure SQL to read from Elasticsearch

Data Sources in Spark SQL

Push-Down operations

Data Sources as tables

Reading DataFrames (Spark SQL 1.3) from Elasticsearch

Scala

Java

Spark SQL Type conversion

Spark Structured Streaming support

Supported Spark Structured Streaming versions

Writing Streaming Datasets (Spark SQL 2.0+) to Elasticsearch

Scala

Java

Writing existing JSON to Elasticsearch

Sink commit log in Spark Structured Streaming

Spark Structured Streaming Type conversion

Using the Map/Reduce layer

Configuration

Reading data from Elasticsearch

Old (org.apache.hadoop.mapred) API

New (org.apache.hadoop.mapreduce) API

Using the connector from PySpark

Writing `DStream` to Elasticsearch

Writing `DataFrame` (Spark SQL 1.3+) to Elasticsearch

Reading `DataFrame`s (Spark SQL 1.3) from Elasticsearch

Writing Streaming `Datasets` (Spark SQL 2.0+) to Elasticsearch

Old (`org.apache.hadoop.mapred`) API

New (`org.apache.hadoop.mapreduce`) API

	Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as distributed SQL query engine.
	-- Spark website

	Spark SQL package import
	elasticsearch-hadoop Spark package import
	Read a text file as normal `RDD` and map it to a `DataFrame` (using the `Person` case class)
	Index the resulting `DataFrame` to Elasticsearch through the `saveToEs` method

	Spark SQL Java imports
	elasticsearch-hadoop Spark SQL Java imports
	index the `DataFrame` in Elasticsearch under `spark/people`

	statically import `JavaEsSpark SQL`
	call `saveToEs` method without having to type `JavaEsSpark` again

	`SQLContext` experimental `load` method for arbitrary data sources
	path or resource to load - in this case the index/type in Elasticsearch
	the data source provider - `org.elasticsearch.spark.sql`

	`SQLContext` experimental `read` method for arbitrary data sources
	the data source provider - `org.elasticsearch.spark.sql`
	path or resource to load - in this case the index/type in Elasticsearch

Name	Default value	Description
`path`	required	Elasticsearch index/type
`pushdown`	`true`	Whether to translate (push-down) Spark SQL into Elasticsearch Query DSL
`strict`	`false`	Whether to use exact (not analyzed) matching or not (analyzed)
Usable in Spark 1.6 or higher
`double.filtering`	`true`	Whether to tell Spark apply its own filtering on the filters pushed down

	`pushdown` option - specific to Spark data sources
	`es.nodes` configuration option
	pass the options when definition/loading the source

	Spark’s temporary table name
	`USING` clause identifying the data source provider, in this case `org.elasticsearch.spark.sql`
	elasticsearch-hadoop configuration options, the mandatory one being `resource`. The `es.` prefix is fixed due to the SQL parser

SQL syntax	ES 1.x/2.x syntax	ES 5.x syntax
= null , is_null	missing	must_not.exists
= (strict)	term	term
= (not strict)	match	match
> , < , >= , ⇐	range	range
is_not_null	exists	exists
in (strict)	terms	terms
in (not strict)	or.filters	bool.should
and	and.filters	bool.filter
or	or.filters	bool.should [bool.filter]
not	not.filter	bool.must_not
StringStartsWith	wildcard(arg*)	wildcard(arg*)
StringEndsWith	wildcard(*arg)	wildcard(*arg)
StringContains	wildcard(arg)	wildcard(arg)
EqualNullSafe (strict)	term	term
EqualNullSafe (not strict)	match	match