What is an Elasticsearch index?
The term index is quite overloaded in the tech world. If you asked most developers what an index is, they might tell you it commonly refers to a data structure in a relational database (RDBMS) that is associated with a table, which improves the speed of data retrieval operations.
But what is an Elasticsearch® index? An Elasticsearch index is a logical namespace that holds a collection of documents, where each document is a collection of fields — which, in turn, are key-value pairs that contain your data.
How is an Elasticsearch index different from a relational database?
Elasticsearch indices are not the same as you’d find in a relational database. Think of an Elasticsearch cluster as a database that can contain many indices you can consider as a table, and within each index, you have many documents.
- RDBMS => Databases => Tables => Columns/Rows
- Elasticsearch => Clusters => Indices => Shards => Documents with key-value pairs
While Elasticsearch stores JSON documents, what you input into the index is incredibly flexible. It’s a quick process to get up and running using the multitude of Integrations and Beats available. Or you can go a little further and define your own ETL processes using Ingest Pipelines or Logstash®, with the aid of their numerous processors and plugins.
Another departure from relational databases is that you can import data without the need for any upfront schema definition. Dynamic types are a great way to get started quickly or to account for unexpected fields in documents. Then, once you have things set, switch to a fixed schema to improve performance.
Runtime fields are another interesting feature that allows you to do schema on read or write. They can be added to an existing document and used to derive a new field, or you can create a runtime field at query time. Think of them as computed values using scripting that can read into the source of the document.
Ready to see the difference in action? Try it out for free today with a trial account on Elastic Cloud.
How data interacts with Elasticsearch’s user-friendly API
Elasticsearch provides a RESTful JSON-based API for interacting with document data. You can index, search, update, and delete documents by sending HTTP requests to the appropriate cluster endpoints. These CRUD-like operations can take place at an individual document level or at the index level itself. If you’d prefer, there are also language-specific client libraries you can use instead of direct REST.
The following example creates a document in an index called playwrights with an assigned document_id of 1. Notice we don’t need to create any schemas or upfront configuration; we simply insert our data.
We can further add documents and fields as we like, which is not something you could do easily with a relational database.
Now we can query out all the documents using the search endpoint.
Or we can query for a specific year of birth.
In addition to basic querying, Elasticsearch provides advanced search features like fuzzy matching, stemming, relevance scoring, highlighting, and tokenization which breaks text down into smaller chunks, called tokens. In most cases, these tokens are individual words but there are many different tokenizers available.
Why is denormalized data key to faster data retrieval?
In relational databases, normalization is often applied to eliminate data redundancy and ensure data consistency. For example, you might have separate tables for customers, products, and orders.
In Elasticsearch, denormalization is a common practice. Instead of splitting data across multiple tables, you store all the relevant information in a single JSON document. An order document would contain the customer information and the product information, rather than the order document holding foreign keys referring to separate product and customer indices. This allows for faster and more efficient retrieval of data in Elasticsearch during search operations. As a general rule of thumb, storage can be cheaper than compute costs for joining data.
How does Elasticsearch ensure scalability in distributed systems?
Each index is identified by a unique name and is divided into one or more shards, which are smaller subsets of the index that allow for parallel processing and distributed storage across a cluster of Elasticsearch nodes. Shards have a primary and a replica shard, replicas provide redundant copies of your data to protect against hardware failure and increase capacity to serve read requests like searching or retrieving a document.
Adding more nodes into the cluster gives you more capacity for indexing and searching, something that’s not so easily achieved with a relational database.
Going back to our playwrights example from above, if we run the following, we can see the type mappings that Elasticsearch automatically inferred and the number of shards and replicas the index has assigned.
What types of data can be indexed in Elasticsearch?
Elasticsearch can index many types of data — firstly text, but also numeric and geolocational data. It can also store dense vectors that are used in similarity searches. Let’s look at each of these in turn.
Inverted indices for text/lexical search
Elasticsearch will also choose the best underlying data structure to use for a particular field type. For example, text would be tokenized and then stored in an inverted index, which is a structure that lists every unique token that appears in any document and identifies all of the documents each word occurs in.
The following table shows the general makeup of an inverted index. We can see that if we were to search for the term “London,” we find that it occurs in six different documents in the index. It’s this inverted index that allows us to perform textual queries very quickly.
Numeric and geolocational search capabilities for efficient spatial analysis
Numeric and geolocational data would be stored in BKD trees, also known as a Block KD-Tree index, which is a data structure used in engineering applications for efficient spatial indexing and querying of multidimensional data. It organizes data points into blocks, allowing for fast-range searches and nearest-neighbor queries in large data sets, making it a valuable tool for engineers dealing with spatial data analysis and optimization.
Vector/semantic search with NLP
You may have heard about vector search, but what is it? Vector search engines — known as vector databases, semantic, or cosine search — find the nearest neighbors to a given (vectorized) query. The power of vector search is that it discovers similar documents that are not an exact textual match, as would be required by our inverted index example above; it instead uses vectors that describe some level of similarity.
The natural language processing (NLP) community has developed a technique called text embedding that encodes words and sentences as numeric vectors. These vector representations are designed to capture the linguistic content of the text, and they can be used to assess the similarity between a query and a document.
Some common use cases for vector search are:
- Answering questions
- Finding answers to previously answered questions, where the question asked is similar but not exactly the same in a textual form
- Making recommendations — for example, a music application finding similar songs based on your preferences
All of these use cases leverage vectors with tens of thousands of dimensions, providing comprehensive representation of the data for accurate similarity assessment and targeted recommendations.
Elasticsearch supports vector search via the dense_vector document type and its ability to run similarity searches between the vector in the document and the search term after it has been converted into a vector.
For those who want to delve a little deeper into generative AI, we also offer ESRE, the Elasticsearch Relevance Engine™, which is designed to power artificial intelligence-based search applications. ESRE gives developers a full suite of sophisticated retrieval algorithms and the ability to integrate with large language models.
Try it out!
As you can see, Elasticsearch indices have come a long way since Elastic co-founder and Chief Technology Officer Shay Banon first wrote a recipe search engine for his wife. There’s a lot more to discover, and a great place to start is by creating a trial account on Elastic Cloud — you’ll be up and running in minutes. In addition, watch our on-demand webinar and get started with Elasticsearch.
Originally published February 24, 2013; updated July 17, 2023.