Added in 6.7.
Kerberos support for Elasticsearch for Apache Hadoop requires Elasticsearch 6.7 or greater
Securing Hadoop means using Kerberos. Elasticsearch supports Kerberos as an authentication method. While the use of Kerberos is not required for securing Elasticsearch, it is a convenient option for those who already deploy Kerberos to secure their Hadoop clusters. This chapter aims to explain the steps needed to set up elasticsearch-hadoop to use Kerberos authentication for Elasticsearch.
Elasticsearch for Apache Hadoop communicates with Elasticsearch entirely over HTTP. In order to support Kerberos authentication over HTTP, elasticsearch-hadoop uses the Simple and Protected GSSAPI Negotiation Mechanism (SPNEGO) to negotiate which underlying authentication method to use (in this case, Kerberos) and to transmit the agreed upon credentials to the server. This authentication mechanism is performed using the HTTP Negotiate authentication standard, where a request is sent to the server and a response is received back with a payload that further advances the negotiation. Once the negotiation between the client and server is complete, the request is accepted and a successful response is returned.
Elasticsearch for Apache Hadoop makes use of Hadoop’s user management processes; The Kerberos credentials of the current Hadoop user are used when authenticating to Elasticsearch. This means that Kerberos authentication in Hadoop must be enabled in order for elasticsearch-hadoop to obtain a user’s Kerberos credentials. In the case of using an integration that does not depend on Hadoop’s runtime (e.g. Storm), additional steps may be required to ensure that a running process has Kerberos credentials available for authentication. It is recommended that you consult the documentation of each framework that you are using on how to configure security.
This documentation assumes that you have already provisioned a Hadoop cluster with Kerberos authentication enabled (required). The general process of deploying Kerberos and securing Hadoop is beyond the scope of this documentation.
Before starting, you will need to ensure that principals for your users are provisioned in your Kerberos deployment, as well as service principals for each Elasticsearch node. To enable Kerberos authentication on Elasticsearch, it must be configured with a Kerberos realm. It is recommended that you familiarize yourself with how to configure Elasticsearch Kerberos realms so that you can make appropriate adjustments to fit your deployment. You can find more information on how they work in the Elastic Stack documentation.
Additionally, you will need to configure the API Key Realm in Elasticsearch. Hadoop and other distributed data processing frameworks only authenticate with Kerberos in the process that launches a job. Once a job has been launched, the worker processes are often cut off from the original Kerberos credentials and need some other form of authentication. Hadoop services often provide mechanisms for obtaining Delegation Tokens during job submission. These tokens are then distributed to worker processes which use the tokens to authenticate on behalf of the user running the job. Elasticsearch for Apache Hadoop obtains API Keys in order to provide tokens for worker processes to authenticate with.
The following settings are used to configure elasticsearch-hadoop to use Kerberos authentication:
Required. Similar to most Hadoop integrations, this property signals which method to use in order to authenticate with
Elasticsearch. By default, the value is
es.net.http.auth.useris set, in which case it will default to
basic. The available options for this setting are
simplefor no authentication,
basicfor basic http authentication,
pkiif relying on certificates, and
kerberosif Kerberos authentication over SPNEGO should be used.
es.security.authenticationis set to
kerberos. Details the name of the service principal that the Elasticsearch server is running as. This will usually be of the form
HTTP/node.address@REALM. Since Elasticsearch is distributed and should be using a service principal per node, you can use the
_HOSTpattern (like so
HTTP/_HOST@REALM) to have elasticsearch-hadoop substitute the address of the node it is communicating with at runtime. Note that elasticsearch-hadoop will attempt to reverse resolve node IP addresses to hostnames in order to perform this substitution.
Optional. The SPNEGO mechanism assumes that authentication may take multiple back and forth request-response cycles for
a request to be fully accepted by the server. When a request is finally accepted by the server, the response contains a
payload that can be verified to ensure that the server is the principal they say they are. Setting this to
trueinstructs elasticsearch-hadoop to perform this mutual authentication, and to fail the response if it detects invalid credentials from the server.
Before using Kerberos authentication to Elasticsearch, Kerberos authentication must be enabled in Hadoop.
Elasticsearch for Apache Hadoop only needs a few settings to configure Kerberos authentication. It is best to
set these properties in your
core-site.xml configuration so that they can be obtained across your entire Hadoop
deployment, just like you would for turning on security options for services in Hadoop.
<configuration> ... <property> <name>es.security.authentication</name> <value>kerberos</value> </property> <property> <name>es.net.spnego.auth.elasticsearch.principal</name> <value>HTTP/_HOST@REALM.NAME.HERE</value> </property> ... </configuration>
When applications launch on a YARN cluster, they send along all of their application credentials to the Resource Manager process for them to be distributed to the containers. The Resource Manager has the ability to renew any tokens in those credentials that are about to expire and to cancel tokens once a job has completed. The tokens from Elasticsearch have a default lifespan of 7 days and they are not renewable. It is a best practice to configure YARN so that it is able to cancel those tokens at the end of a run in order to lower the risk of unauthorized use, and to lower the amount of bookkeeping Elasticsearch must perform to maintain them.
In order to configure YARN to allow it to cancel Elasticsearch tokens at the end of a run, you must add the elasticsearch-hadoop jar to the
Resource Manager’s classpath. You can do that by placing the jar on the Resource Manager’s local filesystem, and setting
the path to the jar in the
YARN_USER_CLASSPATH environment variable. Once the jar is added, the Resource Manager will
need to be restarted.
Additionally, the connection information for elasticsearch-hadoop should be present in the Hadoop configuration,
core-site.xml. This is because when Resource Manager cancels a token, it does not take the job
configuration into account. Without the connection settings in the Hadoop configuration, the Resource Manager will not
be able to communicate to Elasticsearch in order to cancel the token.
Here is a few common security properties that you will need in order for the Resource Manager to contact Elasticsearch to cancel tokens:
<configuration> ... <property> <name>es.nodes</name> <value>es-master-1,es-master-2,es-master-3</value> </property> <property> <name>es.security.authentication</name> <value>kerberos</value> </property> <property> <name>es.net.spnego.auth.elasticsearch.principal</name> <value>HTTP/_HOST@REALM</value> </property> <property> <name>es.net.ssl</name> <value>true</value> </property> <property> <name>es.net.ssl.keystore.location</name> <value>file:///path/to/ssl/keystore</value> </property> <property> <name>es.net.ssl.truststore.location</name> <value>file:///path/to/ssl/truststore</value> </property> <property> <name>es.keystore.location</name> <value>file:///path/to/es/secure/store</value> </property> ... </configuration>
The addresses of some Elasticsearch nodes. These can be any nodes (or all of them) as long as they all belong to the same cluster.
Authentication must be configured as
The name of the Elasticsearch service principal is not required for token cancellation but having the property in the
SSL should be enabled if you are using a secured Elasticsearch deployment.
Location on the local filesystem to reach the SSL Keystore.
Location on the local filesystem to reach the SSL Truststore.
Location on the local filesystem to reach the elasticsearch-hadoop secure store for secure settings.
Before launching your Map/Reduce job, you must add a delegation token for Elasticsearch to the job’s credential set. The
EsMapReduceUtil utility class can be used to do this for you. Simply pass your job to it before submitting it to the
cluster. Using the local Kerberos credentials, the utility will establish a connection to Elasticsearch, request an API Key, and
stow the key in the job’s credential set for the worker processes to use.
Creating a new job instance
EsMapReduceUtil obtains job delegation tokens for Elasticsearch
Submit the job to the cluster
You can obtain the job delegation tokens at any time during the configuration of the Job object, as long as your elasticsearch-hadoop specific configurations are set. It’s usually sufficient to do it right before submitting the job. You should only do this once per job since each call will wastefully obtain another API Key.
Additionally, the utility is also compatible with the
mapred API classes:
Using Kerberos auth on Elasticsearch is only supported using HiveServer2.
Before using Kerberos authentication to Elasticsearch in Hive, Kerberos authentication must be enabled for Hadoop. Make sure you have done all the required steps for configuring your Hadoop cluster as well as the steps for configuring your YARN services before using Kerberos authentication for Elasticsearch.
Finally, ensure that Hive Security is enabled.
Hive’s security model follows a proxy-based approach. When a client submits a query to a secured Hive server, Hive authenticates the client using Kerberos. Once Hive is sure of the client’s identity, it wraps its own identity with a proxy user. The proxy user contains the client’s simple user name, but contains no credentials. Instead, it is expected that all interactions are executed as the Hive principal impersonating the client user. This is why when configuring Hive security, one must specify in the Hadoop configuration which users Hive is allowed to impersonate:
<property> <name>hadoop.proxyuser.hive.hosts</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.hive.groups</name> <value>*</value> </property>
Elasticsearch supports user impersonation, but only users from certain realm implementations can be impersonated. Most deployments of Kerberos include other identity management components like LDAP or Active Directory. In those cases, you can configure those realms in Elasticsearch to allow for user impersonation.
If you are only using Kerberos, or you are using a solution for which Elasticsearch does not support user impersonation, you must
mirror your Kerberos principals to either a
native realm or a
file realm in Elasticsearch. When mirroring a
Kerberos principal to one of these realms, set the new user’s username to just the main part of the principal name, without
any realm or host information. For instance,
client@REALM would just be
would just be
You can follow this step by step process for mirroring users:
Create a role for your end users that will be querying Hive. In this example, we will make a simple role for accessing
indices that match
hive-index-*. All our Hive users will end up using this role to read, write, and update indices
Now that the user role is created, we must map the Kerberos user principals to the role. Elasticsearch does not know the complete list of principals that are managed by Kerberos. As such, each principal that wishes to connect to Elasticsearch must be mapped to a list of roles that they will be granted after authentication.
You may not have to perform this step if you are deploying LDAP or Active Directory along with Kerberos. Elasticsearch will perform user impersonation by looking up the user names in those realms as long as the simple names (e.g. hive.user.1) on the Kerberos principals match the user names LDAP or Active Directory exactly.
Mirroring the user to the native realm will allow Elasticsearch to accept authentication requests from the original principal as well as accept requests from Hive which is impersonating the user. You can create a user in the native realm like so:
The user name is
Provide a password here for the user. This should ideally be a securely generated random password since this mirrored user is just for impersonation purposes.
Setting the user’s roles to be the example role
This is not required, but setting the original principal on the user as metadata may be helpful for your own bookkeeping.
Once you have configured Elasticsearch with a role mapping for your Kerberos principals and native users for impersonation, you must create a role that Hive will use to impersonate those users.
Now that there are users to impersonate, and a role that can impersonate them, make sure to map the Hive principal to the proxier role, as well as any of the roles that the users it is impersonating would have. This allows the Hive principal to create and read indices, documents, or do anything else its impersonated users might be able to do. While Hive is impersonating the user, it must have these roles or else it will not be able to fully impersonate that user.
Here we set the roles to be the superset of the roles from the users we want to impersonate. In our example, the
The role that allows Hive to impersonate Hive end users.
The name of the Hive server principal to match against.
If managing Kerberos role mappings via the API’s is not desired, they can instead be managed in a role mapping file.
Once all user accounts are configured and all previous steps for enabling Kerberos auth in Hadoop and Hive are complete, there should be no differences in creating Hive queries from before.
Before using Kerberos authentication to Elasticsearch in Pig, Kerberos authentication must be enabled for Hadoop. Make sure you have done all the required steps for configuring your Hadoop cluster as well as the steps for configuring your YARN services before using Kerberos authentication for Elasticsearch.
If elasticsearch-hadoop is configured for Kerberos authentication and Hadoop security is enabled, elasticsearch-hadoop’s storage functions in Pig will automatically obtain delegation tokens for jobs when submitting them to the cluster.
Using Kerberos authentication in elasticsearch-hadoop for Spark has the following requirements:
- Your Spark jobs must be deployed on YARN. Using Kerberos authentication in elasticsearch-hadoop does not support any other Spark cluster deployments (Mesos, Standalone).
- Your version of Spark must be on or above version 2.1.0. It is this version that Spark added the ability to plug in third-party credential providers to obtain delegation tokens.
Before using Kerberos authentication to Elasticsearch in Spark, Kerberos authentication must be enabled for Hadoop. Make sure you have done all the required steps for configuring your Hadoop cluster as well as the steps for configuring your YARN services before using Kerberos authentication for Elasticsearch.
Before Spark submits an application to a YARN cluster,
it loads a number of
credential provider implementations that are used to determine if any additional credentials must be obtained before
the application is started. These implementations are loaded using Java’s
ServiceLoader architecture. Thus, any jar
that is on the classpath when the Spark application is submitted can offer implementations to be loaded and used.
EsServiceCredentialProvider is one such implementation that is loaded whenever elasticsearch-hadoop is on the job’s classpath.
EsServiceCredentialProvider determines if Kerberos authentication is enabled for elasticsearch-hadoop. If it is determined
that Kerberos authentication is enabled for elasticsearch-hadoop, then the credential provider will automatically obtain delegation tokens
from Elasticsearch and add them to the credentials on the YARN application submission context. Additionally, in the case that
the job is a long lived process like a Spark Streaming job, the credential provider is used to update or obtain new
delegation tokens when the current tokens approach their expiration time.
The time that Spark’s credential providers are loaded and called depends on the cluster deploy mode when submitting your
Spark app. When running in
client deploy mode, Spark runs the user’s driver code in the local JVM, and launches the
YARN application to oversee the processing as needed. The providers are loaded and run whenever the YARN application
first comes online. When running in
cluster deploy mode, Spark launches the YARN application immediately, and the
user’s driver code is run from the resulting Application Master in YARN. The providers are loaded and run immediately,
before any user code is executed.
All implementations of the Spark credential providers use settings from only a few places:
- The entries from the local Hadoop configuration files
- The entries of the local Spark configuration file
- The entries that are specified from the command line when the job is initially launched
Settings that are configured from the user code are not used because the provider must run once for all jobs that are submitted for a particular Spark application. User code is not guaranteed to be run before the provider is loaded. To make things more complicated, a credential provider is only given the local Hadoop configuration to determine if they should load delegation tokens.
These limitations mean that the settings to configure elasticsearch-hadoop for Kerberos authentication need to be in specific places:
es.security.authentication MUST be set in the local Hadoop configuration files as kerberos. If it is not set
in the Hadoop configurations, then the credential provider will assume that simple authentication is to be used, and
will not obtain delegation tokens.
Secondly, all general connection settings for elasticsearch-hadoop (like
es.ssl.enabled, etc…) must be specified either
in the local Hadoop configuration files, in the local Spark configuration file, or from the command
line. If these settings are not available here, then the credential provider will not be able to contact Elasticsearch in order
to obtain the delegation tokens that it requires.
$> bin/spark-submit \ --class org.myproject.MyClass \ --master yarn \ --deploy-mode cluster \ --jars path/to/elasticsearch-hadoop.jar \ --conf 'spark.es.nodes=es-node-1,es-node-2,es-node-3' --conf 'spark.es.ssl.enabled=true' --conf 'spark.es.net.spnego.auth.elasticsearch.principal=HTTP/_HOST@REALM' path/to/jar.jar
An example of some connection settings specified at submission time
Be sure to include the Elasticsearch service principal.
Specifying this many configurations in the spark-submit command line is a pretty sure fire way to miss important settings. Thus, it is advised to set them in the cluster wide Hadoop config.
In the event that you are running a streaming job, it is best to use the
cluster deploy mode to allow YARN to
manage running the driver code for the streaming application.
Since streaming jobs are expected to run continuously without stopping, you should configure Spark so that the credential provider can obtain new tokens before the original tokens expire.
Configuring Spark to obtain new tokens is different from configuring YARN to renew and cancel tokens. YARN can only renew existing tokens up to their maximum lifetime. Tokens from Elasticsearch are not renewable. Instead, they have a simple lifetime of 7 days. After those 7 days elapse, the tokens are expired. In order for an ongoing streaming job to continue running without interruption, completely new tokens must be obtained and sent to worker tasks. Spark has facilities for automatically obtaining and distributing completely new tokens once the original token lifetime has ended.
When submitting a Spark application on YARN, users can provide a principal and keytab file to the
command. Spark will log in with these credentials instead of depending on the local Kerberos TGT Cache for the current
user. In the event that any delegation tokens are close to expiring, the loaded credential providers are given the
chance to obtain new tokens using the given principal and keytab before the current tokens fully expire. Any new tokens
are automatically distributed by Spark to the containers on the YARN cluster.
When elasticsearch-hadoop is on the classpath,
EsServiceCredentialProvider is ALWAYS loaded by Spark. If Kerberos authentication is
enabled for elasticsearch-hadoop in the local Hadoop configuration, then the provider will attempt to load delegation tokens for Elasticsearch
regardless of if they are needed for that particular job.
It is advised that you do not add elasticsearch-hadoop libraries to jobs that are not configured to connect to or interact with Elasticsearch. This is the easiest way to avoid the confusion of unrelated jobs failing to launch because they cannot connect to Elasticsearch.
If you find yourself in a place where you cannot easily remove elasticsearch-hadoop from the classpath of jobs that do not need to interact with Elasticsearch, then you can explicitly disable the credential provider by setting a property at launch time. The property to set is dependent on your version of Spark:
For Spark 2.3.0 and up: set the
For Spark 2.1.0-2.3.0: set the
false. This property is still accepted in Spark 2.3.0+, but is marked as deprecated.
Before using Kerberos authentication to Elasticsearch in Cascading, Kerberos authentication must be enabled for Hadoop. Make sure you have done all the required steps for configuring your Hadoop cluster as well as the steps for configuring your YARN services before using Kerberos authentication for Elasticsearch.
The ES-Hadoop Cascading integration only supports Kerberos authentication using the Hadoop flow connector.
When running in Hadoop mode, before launching your job, you must add a delegation token for Elasticsearch to the currently
logged in user’s credential set. When launching the job, ensure that the process can obtain your Kerberos credentials,
either by being logged in with kinit, or performing the login manually using Hadoop’s UserGroupInformation class. Once
you are logged in, simply use the provided
EsTap.initCredentials(Properties) method to obtain a token and cache it
on the current user.
Initialize the current user with an Elasticsearch delegation token
Setup your process as usual
Use the HadoopFlowConnector for executing the job
Run the application once it has been configured
EsTap.initCredentials method uses the properties available to connect to Elasticsearch using Kerberos, obtain
delegation tokens, and stores them on the current user. Obtaining credentials should ideally be done on a per job basis
to avoid issues with token cancellation.
Your Storm deployment should be secured, but configuring it for security is not strictly required.
Storm is not always deployed alongside a Hadoop distribution. Thus, configuring Kerberos authentication for Hadoop is not required for using Kerberos authentication to Elasticsearch on Storm.
Storm provides a
myriad of plugin interfaces that can be loaded and used to collect, update, and renew credentials over the lifetime of
a running topology. elasticsearch-hadoop provides the
AutoElasticsearch class which Storm can use to automatically obtain and renew
Elasticsearch delegation tokens for a topology.
AutoElasticsearch implements Storm’s
interfaces. The first of which is used to obtain delegation tokens on Nimbus before submitting a topology. The second
is used for updating the credentials on the worker nodes, and the third is used for obtaining new delegation tokens
when the current tokens are close to expiring.
In order for the
AutoElasticsearch plugin to obtain credentials, Kerberos authentication must be enabled for elasticsearch-hadoop in its
settings. You must specify the
es.security.authentication setting in either the storm.yaml file or on the topology
AutoElasticsearch plugin provides two settings for denoting the principal and keytab to be used when executing:
- Required. The principal that the plugin should use for obtaining credentials for this topology. Can be set in the storm.yaml configuration or in the topology configuration.
- Required. The path to the keytab on Nimbus that will be used for logging in as the given principal. This can be set in the storm.yaml configuration or in the topology configuration. The file must exist on Nimbus.
Nimbus must be configured to use
AutoElasticsearch as a credential plugin from the
storm.yaml configuration file.
It is safe to specify
AutoElasticsearch in these settings even if your topology does not interact with Elasticsearch. The
plugin will perform no operations unless
AutoElasticsearch is explicitly enabled on the topology.
The list of auto credential plugins to be run on Nimbus when submitting a topology
The list of all the credential renewers available for Nimbus to run
The frequency at which the credential renewers on Nimbus should be executed to check and update credentials.
In order for the plugin to be loaded, elasticsearch-hadoop must be present on the Nimbus classpath. You can add it to the classpath by using an environment variable on Nimbus.
Once Nimbus is configured, you must add
AutoElasticsearch to your topology configuration in order for delegation
tokens to be obtained and updated. If you do not specify it in the topology configuration, then Storm will not attempt
to obtain Elasticsearch delegation tokens when the topology is submitted.
Config conf = new Config(); List plugins = new ArrayList(); plugins.add(AutoElasticsearch.class.getName()); conf.put(Config.TOPOLOGY_AUTO_CREDENTIALS, plugins); ... conf.put(ConfigurationOptions.ES_SECURITY_AUTHENTICATION, "kerberos"); conf.put(ConfigurationOptions.ES_NET_SPNEGO_AUTH_ELASTICSEARCH_PRINCIPAL, "HTTP/elasticsearch.node.address@REALM"); ...
Configure the topology with
If you have not enabled Kerberos authentication for elasticsearch-hadoop in the storm.yaml configuration file, you will need to set the properties here.