Elasticsearch Internals: an Overview
UPDATE: This article refers to our hosted Elasticsearch offering by an older name, Found. Please note that Found is now known as Elastic Cloud.
This article gives an overview of the Elasticsearch internals. I will present a 10,000 foot view of the different modules that Elasticsearch is composed of and how we can extend or replace built-in functionality using plugins.
A module in Elasticsearch is a Guice Module component that contributes configuration information and binds a specific implementation of the various interfaces of Elasticsearch. A standard Elasticsearch server without any plugins currently consists of over 100 modules:
$ <span class="kw">find</span> src/main -name <span class="dt">\*</span>Module<span class="dt">\*</span> <span class="kw">|</span> <span class="kw">grep</span> -v common <span class="kw">|</span> <span class="kw">wc</span> -l <span class="kw">101</span>
During startup, Elasticsearch collects different modules based on its configuration and runtime environment and creates what is called an
Injector. To put it simply, an Injector is an object that can construct instances of classes without us providing the parameters required for their construction. The Injector will instead use its configured modules to locate all required dependencies and create them in a topologically sorted order for us. This both saves us a lot of time and assists us in creating a composable, modular system. And a system the size of Elasticsearch needs to be just that in order to be maintainable over time.
Dependency injection means giving an object its instance variables
In the rest of this post we’ll see how this makes Elasticsearch extremely pluggable and extendable. We start by taking a quick look at the HTTP side of Elasticsearch to see how Guice and dependency injection works in action.
For example, Elasticsearch comes with a HTTP server, and to avoid coupling any component that works with HTTP operations in Elasticsearch, it has been abstracted to a
HttpServer class, which during construction gets a
HttpServerTransport that is by default bound to a concrete implementation called
NettyHttpServerTransport. In other words, the
HttpServer has no idea how the requests or responses are received and sent, it only has to deal with the application logic (i.e finding an appropriate request handler for incoming requests).
If, down the line, Elasticsearch should want to switch HTTP layers due to any number of reasons such as performance, security, features etc, it’s as simple as swapping out a configuration value (in this case
http.type). As a matter of fact, Sonian has released a HTTP layer for Elasticsearch based on Jetty, which does exactly this. It’s available here. The Jetty-based implementation supports additional features like authentication, which might be a requirement in some deployments.
Here’s a bird’s-eye view of modules and their namespaces:
In this overview, multiple modules are put into different namespaces according to their location in the source code tree or usage in the code. Different aspects of Elasticsearch, such as the REST interface, the plugins, rivers and transport service are separate entities with no module level dependencies between them. Modules that otherwise would have been too large or non-composable are further separated into smaller modules in nested namespaces.
The full overview of all modules is rather large. Click on the thumbnail below to see it in full size.
Legend: Namespaces – Module names – Bound classes
Most modules only provide bindings for one or more classes or interfaces, but some modules spawn new modules when they’re used. These modules are usually conditionally spawned depending on the current
Settings object, which is read from the
elasticsearch.yml configuration file during start-up. This enables plugin authors to write plugins that replace or extend built-in functionality in Elasticsearch that can be enabled and configured via the default configuration system.
One concrete example of this, is the Discovery Module, which by default uses either the Zen or Local discovery modules, but may have its implementation swapped by setting the
discovery.type configuration value to the canonical name of another module. This is exactly how Sonian’s ZooKeeper Discovery is used.
While it’s a rather large task to completely swap out entire implementations of components in Elasticsearch, these modules provide us with an easy way of adding functionality to Elasticsearch. Since all classes instantiated by the module system in Elasticsearch may request a reference to any other class instanted by the same system, it’s almost trivial to extend the functionality via well defined interfaces.
One of the most important modules for many developers working with Elasticsearch and Lucene is the
AnalysisModule. The Analysis module is responsible for providing the analyzers, tokenizers, character filters and token filters used by Elasticsearch during indexing and searching. Using the API provided by the Analysis module, it’s trivial to add any Lucene compatible analyzer, tokenizer or filter from a plugin to Elasticsearch with just a few lines of code.