02 October 2013

Elasticsearch Internals: an Overview

By Njal Karevoll

UPDATE: This article refers to our hosted Elasticsearch offering by an older name, Found. Please note that Found is now known as Elastic Cloud.

This article gives an overview of the Elasticsearch internals. I will present a 10,000 foot view of the different modules that Elasticsearch is composed of and how we can extend or replace built-in functionality using plugins.

What is a Module?

A module in Elasticsearch is a Guice Module component that contributes configuration information and binds a specific implementation of the various interfaces of Elasticsearch. A standard Elasticsearch server without any plugins currently consists of over 100 modules:

$ <span class="kw">find</span> src/main -name <span class="dt">\*</span>Module<span class="dt">\*</span> <span class="kw">|</span> <span class="kw">grep</span> -v common <span class="kw">|</span> <span class="kw">wc</span> -l
     <span class="kw">101</span>

During startup, Elasticsearch collects different modules based on its configuration and runtime environment and creates what is called an Injector. To put it simply, an Injector is an object that can construct instances of classes without us providing the parameters required for their construction. The Injector will instead use its configured modules to locate all required dependencies and create them in a topologically sorted order for us. This both saves us a lot of time and assists us in creating a composable, modular system. And a system the size of Elasticsearch needs to be just that in order to be maintainable over time.

In more technical terms, this is known as Dependency Injection, which is explained shortly by James Shore as:

Dependency injection means giving an object its instance variables

In the rest of this post we’ll see how this makes Elasticsearch extremely pluggable and extendable. We start by taking a quick look at the HTTP side of Elasticsearch to see how Guice and dependency injection works in action.

HTTP Server Modules

For example, Elasticsearch comes with a HTTP server, and to avoid coupling any component that works with HTTP operations in Elasticsearch, it has been abstracted to a HttpServer class, which during construction gets a HttpServerTransport that is by default bound to a concrete implementation called NettyHttpServerTransport. In other words, the HttpServer has no idea how the requests or responses are received and sent, it only has to deal with the application logic (i.e finding an appropriate request handler for incoming requests).

If, down the line, Elasticsearch should want to switch HTTP layers due to any number of reasons such as performance, security, features etc, it’s as simple as swapping out a configuration value (in this case http.type). As a matter of fact, Sonian has released a HTTP layer for Elasticsearch based on Jetty, which does exactly this. It’s available here. The Jetty-based implementation supports additional features like authentication, which might be a requirement in some deployments.

Overview with Namespaces

Here’s a bird’s-eye view of modules and their namespaces:

Modules overview
Modules overview

In this overview, multiple modules are put into different namespaces according to their location in the source code tree or usage in the code. Different aspects of Elasticsearch, such as the REST interface, the plugins, rivers and transport service are separate entities with no module level dependencies between them. Modules that otherwise would have been too large or non-composable are further separated into smaller modules in nested namespaces.

All Modules

The full overview of all modules is rather large. Click on the thumbnail below to see it in full size.

Legend: Namespaces – Module names – Bound classes

All modules

Spawning New Modules

Most modules only provide bindings for one or more classes or interfaces, but some modules spawn new modules when they’re used. These modules are usually conditionally spawned depending on the current Settings object, which is read from the elasticsearch.yml configuration file during start-up. This enables plugin authors to write plugins that replace or extend built-in functionality in Elasticsearch that can be enabled and configured via the default configuration system.

One concrete example of this, is the Discovery Module, which by default uses either the Zen or Local discovery modules, but may have its implementation swapped by setting the discovery.type configuration value to the canonical name of another module. This is exactly how Sonian’s ZooKeeper Discovery is used.

Extending and Swapping Implementations

While it’s a rather large task to completely swap out entire implementations of components in Elasticsearch, these modules provide us with an easy way of adding functionality to Elasticsearch. Since all classes instantiated by the module system in Elasticsearch may request a reference to any other class instanted by the same system, it’s almost trivial to extend the functionality via well defined interfaces.

One of the most important modules for many developers working with Elasticsearch and Lucene is the AnalysisModule. The Analysis module is responsible for providing the analyzers, tokenizers, character filters and token filters used by Elasticsearch during indexing and searching. Using the API provided by the Analysis module, it’s trivial to add any Lucene compatible analyzer, tokenizer or filter from a plugin to Elasticsearch with just a few lines of code.

Resources

The overviews were created with FreeMind, and the source files can be downloaded here.