Preload Elasticsearch with your data set

blog-thumb-release-elasticsearch.png

Recently, I got a question on the discussion forum on how to modify the official Docker image to provide a ready-to-use Elasticsearch® cluster that contains data already.

Honestly, this is not ideal because you'd have to hack the way the Elasticsearch service starts by providing a forked version of the entrypoint.sh. Unfortunately, this will make your life harder when it comes to maintenance and upgrades. Instead, I found that it would be better to use other solutions to achieve the same goal.

Setting up the problem

To start with this idea, we will consider that we are using the Elasticsearch Docker image and following the documentation:

docker pull docker.elastic.co/elasticsearch/elasticsearch:8.7.0
docker network create elastic
docker run --name es01 --net elastic -p 9200:9200 -it docker.elastic.co/elasticsearch/elasticsearch:8.7.0

Note that we are not mounting the data-dir here, so the data for this cluster will be ephemeral and will disappear once the node has shut down. After it starts, we can check with the generated password that it's running well:

curl -s -k -u elastic:CHANGEME https://localhost:9200 | jq

This gives:

{
  "name": "697bf734a5d5",
  "cluster_name": "docker-cluster",
  "cluster_uuid": "cMISiT__RSWkoKDYql1g4g",
  "version": {
    "number": "8.7.0",
    "build_flavor": "default",
    "build_type": "docker",
    "build_hash": "09520b59b6bc1057340b55750186466ea715e30e",
    "build_date": "2023-03-27T16:31:09.816451435Z",
    "build_snapshot": false,
    "lucene_version": "9.5.0",
    "minimum_wire_compatibility_version": "7.17.0",
    "minimum_index_compatibility_version": "7.0.0"
  },
  "tagline": "You Know, for Search"
}

So, we want to have a data set already available. Let's take the sample data set I often use while demoing Elasticsearch: the person data set. I created a generator to create some fake data.

First, let's download the injector:

wget https://repo1.maven.org/maven2/fr/pilato/elasticsearch/injector/injector/8.7/injector-8.7.jar

Then, we will generate our data set on disk using the following options:

mkdir data
java -jar injector-8.7.jar --console --silent > data/persons.json

We have 1000000 JSON documents, and the data set should look like this:

head -2 data/persons.json

{"name":"Charlene Mickael","dateofbirth":"2000-11-01","gender":"female","children":3,"marketing":{"cars":1236,"shoes":null,"toys":null,"fashion":null,"music":null,"garden":null,"electronic":null,"hifi":1775,"food":null},"address":{"country":"Italy","zipcode":"80100","city":"Ischia","countrycode":"IT","location":{"lon":13.935138341699972,"lat":40.71842684204817}}}
{"name":"Kim Hania","dateofbirth":"1998-05-18","gender":"male","children":4,"marketing":{"cars":null,"shoes":null,"toys":132,"fashion":null,"music":null,"garden":null,"electronic":null,"hifi":null,"food":null},"address":{"country":"Germany","zipcode":"9998","city":"Berlin","countrycode":"DE","location":{"lon":13.164834451298645,"lat":52.604673827377155}}}

Using a Shell script

Here, we have 1m documents, so we cannot really send it as-is using a bulk request. Instead, we need to:

  • Split into 10000 or less index operations
  • For each document, add the missing bulk header
  • Send the documents using the _bulk API I ended up writing this script which requires that you have curl and jq installed:
#!/bin/bash
ELASTIC_PASSWORD=CHANGEME
mkdir tmp
echo "Split the source in 10000 items"
split -d -l10000 ../data/persons.json tmp/part
BULK_REQUEST_FILE="tmp/bulk_request.ndjson"
FILES="tmp/part*"
for f in $FILES
do
  rm $BULK_REQUEST_FILE
  echo "Preparing $f file..."
  while read p; do
    echo -e '{"index":{}}' >> $BULK_REQUEST_FILE
    echo -e "$p" >> $BULK_REQUEST_FILE
  done <$f
  echo "Calling Elasticsearch Bulk API"
  curl -XPOST -s -k -u elastic:$ELASTIC_PASSWORD https://localhost:9200/person/_bulk -H 'Content-Type: application/json' --data-binary "@$BULK_REQUEST_FILE" | jq '"Bulk executed in \(.took) ms with errors=\(.errors)"'
done

This basically prints:

Preparing tmp/part00 file...
Calling Elasticsearch Bulk API
"Bulk executed in 1673 ms with errors=false"
Preparing tmp/part01 file...
Calling Elasticsearch Bulk API
"Bulk executed in 712 ms with errors=false"
...
Preparing tmp/part99 file...
Calling Elasticsearch Bulk API
"Bulk executed in 366 ms with errors=false"

On my machine, it took more than eight minutes to run it, where most of the time is spent on writing the bulk requests. There's probably a lot of room for improvement, but I must confess that I'm not that good at writing shell scripts — Ha. You guessed that already, huh?

Using Logstash

Logstash® can do a similar job of what we have done manually but can also provide much more features, such as error handling and monitoring. Plus, we don't even need to write the code. We will be using Docker again here:

docker pull docker.elastic.co/logstash/logstash:8.7.0

Let's write a job for this:

input {
  file {
    path => "/usr/share/logstash/persons/persons.json"
    mode => "read"
    codec => json { }
    exit_after_read => true
  }
}
filter {
  mutate {
    remove_field => [ "log", "@timestamp", "event", "@version" ]
  }
}
output {
  elasticsearch {
    hosts => "${ELASTICSEARCH_URL}"
    index => "person"
    user => "elastic"
    password => "${ELASTIC_PASSWORD}"
    ssl_certificate_verification => false
  }
}

We can now run the job:

docker run --rm -it --name ls01 --net elastic \
  -v $(pwd)/../data/:/usr/share/logstash/persons/:ro \
  -v $(pwd)/pipeline/:/usr/share/logstash/pipeline/:ro \
  -e XPACK_MONITORING_ENABLED=false \
  -e ELASTICSEARCH_URL="https://es01:9200" \
  -e ELASTIC_PASSWORD="CHANGEME" \
  docker.elastic.co/logstash/logstash:8.7.0

On my machine, it took less than two minutes to run it.

Using docker compose

Instead of running everything manually, you can easily use the docker compose command to run everything as needed and provide your users with a ready-to-use cluster. Here is a simple .env file:

ELASTIC_PASSWORD=CHANGEME
STACK_VERSION=8.7.0
ES_PORT=9200

And the docker-compose.yml:

version: "2.2"
services:
  es01:
    image: docker.elastic.co/elasticsearch/elasticsearch:${STACK_VERSION}
    ports:
      - ${ES_PORT}:9200
    environment:
      - node.name=es01
      - cluster.initial_master_nodes=es01
      - ELASTIC_PASSWORD=${ELASTIC_PASSWORD}
      - bootstrap.memory_lock=true
    ulimits:
      memlock:
        soft: -1
        hard: -1
    healthcheck:
      test:
        [
          "CMD-SHELL",
          "curl -s -k https://localhost:9200 | grep -q 'missing authentication credentials'",
        ]
      interval: 10s
      timeout: 10s
      retries: 120
  logstash:
    depends_on:
      es01:
        condition: service_healthy
    image: docker.elastic.co/logstash/logstash:${STACK_VERSION}
    volumes:
      - type: bind
        source: ../data
        target: /usr/share/logstash/persons
        read_only: true
      - type: bind
        source: pipeline
        target: /usr/share/logstash/pipeline
        read_only: true
    environment:
      - ELASTICSEARCH_URL=https://es01:9200
      - ELASTIC_PASSWORD=${ELASTIC_PASSWORD}
      - XPACK_MONITORING_ENABLED=false

We still have our persons.json file in the ../data dir. It's mounted as /usr/share/logstash/persons/persons.json as it was also in the previous example. So, we are using the same pipeline/persons.conf file as seen before. To run this, we can now just type:

docker compose up

And wait for the with-compose-logstash-1 container to exit:

with-compose-logstash-1  | [2023-04-21T15:17:55,335][INFO ][logstash.runner          ] Logstash shut down.
with-compose-logstash-1 exited with code 0

This indicates that our service is now ready to run and fully loaded with our sample data set.

Using snapshot and restore

You can also use the Create Snapshot API to back up an existing dataset living in Elasticsearch to a shared filesystem or to S3, for example, and then restore it to your new cluster using the Restore API. Let say you have already registered a repository named sample. You can create the snapshot with:

# We force merge the segments first
POST /person/_forcemerge?max_num_segments=1
# Snapshot the data
PUT /_snapshot/sample/persons
{
  "indices": "person",
  "include_global_state": false
}

So, any time that you are starting a new cluster, you can just restore the snapshot with:

POST /_snapshot/sample/persons/_restore

You just need to be careful with this method in that the snapshot you have can still be restored in your cluster when you upgrade it to a new major release. For example, if you have created a snapshot with version 6.3, you won't be able to restore it in 8.2. See Snapshot Index Compatibility for more details. But, good news! With Archive Indices, Elasticsearch now has the ability to access older snapshot repositories (going back to version 5). You just need to be aware of some of the restrictions. To guarantee that your snapshot will always be fully compatible, you might want to snapshot your index again with the most recent version using the same script. Note that the Force Merge API call is important in that case as it will rewrite all the segments using the latest Elasticsearch and Lucene versions.

Using a mounted directory

Remember when we started the cluster?

docker run --name es01 --net elastic -p 9200:9200 -it docker.elastic.co/elasticsearch/elasticsearch:8.7.0

We did not use a bind mount for the data and the config dirs. But, we can actually do this with:

docker run --name es01 --net elastic -p 9200:9200 -it -v persons-data:/usr/share/elasticsearch/data -v persons-config:/usr/share/elasticsearch/config docker.elastic.co/elasticsearch/elasticsearch:8.7.0

We can inspect the Docker volumes that have just been created:

docker volume inspect persons-data persons-config
[
    {
        "CreatedAt": "2023-05-09T10:20:14Z",
        "Driver": "local",
        "Labels": null,
        "Mountpoint": "/var/lib/docker/volumes/persons-data/_data",
        "Name": "persons-data",
        "Options": null,
        "Scope": "local"
    },
    {
        "CreatedAt": "2023-05-09T10:19:51Z",
        "Driver": "local",
        "Labels": null,
        "Mountpoint": "/var/lib/docker/volumes/persons-config/_data",
        "Name": "persons-config",
        "Options": null,
        "Scope": "local"
    }
]

You can reuse this Mountpoint later again if you want to start your Elasticsearch node using the same command line. If you need to share your volumes with other users, you can back up the data from /var/lib/docker/volumes/persons-config/ and /var/lib/docker/volumes/persons-data/ to /tmp/volume-backup:

docker run --rm -it -v /tmp/volume-backup:/backup -v /var/lib/docker:/docker alpine:edge tar cfz /backup/persons.tgz /docker/volumes/persons-config /docker/volumes/persons-data

Then, you can just share the /tmp/volume-backup/persons.tgz file with the other users and let them restore it.

docker volume create persons-config
docker volume create persons-data
docker run --rm -it -v /tmp/volume-backup:/backup -v /var/lib/docker:/docker alpine:edge tar xfz /backup/persons.tgz -C /

And start the container again:

docker run --name es01 --net elastic -p 9200:9200 -it -v persons-data:/usr/share/elasticsearch/data -v persons-config:/usr/share/elasticsearch/config docker.elastic.co/elasticsearch/elasticsearch:8.7.0

Using Elastic Cloud

Of course, instead of starting and managing a local Elasticsearch instance by yourself, you can provision a new Elasticsearch Cloud instance using a snapshot you did previously. The following code supposes that you have already defined an API key.

POST /api/v1/deployments?validate_only=false
{
  "resources": {
    "elasticsearch": [
      {
        "region": "gcp-europe-west1",
        "plan": {
          "cluster_topology": [
            {
              "zone_count": 2,
              "elasticsearch": {
                "node_attributes": {
                  "data": "hot"
                }
              },
              "instance_configuration_id": "gcp.es.datahot.n2.68x10x45",
              "node_roles": [
                "master",
                "ingest",
                "transform",
                "data_hot",
                "remote_cluster_client",
                "data_content"
              ],
              "id": "hot_content",
              "size": {
                "resource": "memory",
                "value": 8192
              }
            }
          ],
          "elasticsearch": {
            "version": "8.7.1"
          },
          "deployment_template": {
            "id": "gcp-storage-optimized-v5"
          },
          "transient": {
            "restore_snapshot": {
              "snapshot_name": "__latest_success__",
              "source_cluster_id": "CLUSTER_ID"
            }
          }
        },
        "ref_id": "main-elasticsearch"
      }
    ],
    "kibana": [
      {
        "elasticsearch_cluster_ref_id": "main-elasticsearch",
        "region": "gcp-europe-west1",
        "plan": {
          "cluster_topology": [
            {
              "instance_configuration_id": "gcp.kibana.n2.68x32x45",
              "zone_count": 1,
              "size": {
                "resource": "memory",
                "value": 1024
              }
            }
          ],
          "kibana": {
            "version": "8.7.1"
          }
        },
        "ref_id": "main-kibana"
      }
    ]
  },
  "settings": {
    "autoscaling_enabled": false
  },
  "name": "persons",
  "metadata": {
    "system_owned": false
  }
}

Just replace CLUSTER_ID with the source cluster where you took the snapshot from. Once the cluster is up, you have a fully functional instance that's available on the internet with the default data set you want. Once you're done, you can easily shut down the deployment using:

POST /api/v1/deployments/DEPLOYMENT_ID/_shutdown

Here, again, just replace DEPLOYMENT_ID with the deployment ID you saw when you created the deployment.

Conclusion

As usual, with Elastic® and Elasticsearch specifically, you have many ways to achieve your goals. Although there are probably other ways to do this, I've listed some of them here:

  • Using a Shell script: You don't really need any third-party tool, but this requires some code to be written. The code looks trivial and is OK when you just run that from time to time. If you need it to be safer like having a catch and retry feature, then that will make you create and maintain even more code.
  • Using Logstash: It's super flexible as you can also send the data to other destinations than Elasticsearch or use multiple filters to modify or enrich the source data set. It's a bit slower to start, but it shouldn't be an issue for testing purposes.
  • Using docker compose: One of my favorite ways. You just run docker compose up and voilà!, you're done after a few minutes. But this can take a while and use hardware resources.
  • Using snapshot and restore: Faster than the previous methods as the data is already indexed but is less flexible since the snapshot needs to be compatible with the cluster you are restoring. In general, I always prefer injecting the data again because everything is fresh, and I can benefit from all the improvements from Elasticsearch and Lucene under the hood.
  • Using mounted directory: Like a snapshot but more local. I honestly prefer using Elastic APIs than mounting an existing directory manually.
  • Using Elastic Cloud: It's in my opinion the easiest way to share a data set with other people like customers or internal testers. It's all set, secure, and ready-to-use with proper SSL certificates.

Depending on your taste and your constraints, you can pick one of those ideas and adapt it for your needs. If you have ideas to share, please tell us on Twitter or on the discussion forum. A lot of great ideas and features are coming from the community. Share yours!

The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.