Loading Sample Dataedit

The tutorials in this section rely on the following data sets:

  • The complete works of William Shakespeare, suitably parsed into fields. Download this data set by clicking here: shakespeare.json.
  • A set of fictitious accounts with randomly generated data. Download this data set by clicking here: accounts.zip
  • A set of randomly generated log files. Download this data set by clicking here: logs.jsonl.gz

Two of the data sets are compressed. Use the following commands to extract the files:

unzip accounts.zip
gunzip logs.jsonl.gz

The Shakespeare data set is organized in the following schema:

{
    "line_id": INT,
    "play_name": "String",
    "speech_number": INT,
    "line_number": "String",
    "speaker": "String",
    "text_entry": "String",
}

The accounts data set is organized in the following schema:

{
    "account_number": INT,
    "balance": INT,
    "firstname": "String",
    "lastname": "String",
    "age": INT,
    "gender": "M or F",
    "address": "String",
    "employer": "String",
    "email": "String",
    "city": "String",
    "state": "String"
}

The schema for the logs data set has dozens of different fields, but the notable ones used in this tutorial are:

{
    "memory": INT,
    "geo.coordinates": "geo_point"
    "@timestamp": "date"
}

Before we load the Shakespeare and logs data sets, we need to set up mappings for the fields. Mapping divides the documents in the index into logical groups and specifies a field’s characteristics, such as the field’s searchability or whether or not it’s tokenized, or broken up into separate words.

Use the following command in a terminal (eg bash) to set up a mapping for the Shakespeare data set:

curl -XPUT http://localhost:9200/shakespeare -d '
{
 "mappings" : {
  "_default_" : {
   "properties" : {
    "speaker" : {"type": "string", "index" : "not_analyzed" },
    "play_name" : {"type": "string", "index" : "not_analyzed" },
    "line_id" : { "type" : "integer" },
    "speech_number" : { "type" : "integer" }
   }
  }
 }
}
';

This mapping specifies the following qualities for the data set:

  • The speaker field is a string that isn’t analyzed. The string in this field is treated as a single unit, even if there are multiple words in the field.
  • The same applies to the play_name field.
  • The line_id and speech_number fields are integers.

The logs data set requires a mapping to label the latitude/longitude pairs in the logs as geographic locations by applying the geo_point type to those fields.

Use the following commands to establish geo_point mapping for the logs:

curl -XPUT http://localhost:9200/logstash-2015.05.18 -d '
{
  "mappings": {
    "log": {
      "properties": {
        "geo": {
          "properties": {
            "coordinates": {
              "type": "geo_point"
            }
          }
        }
      }
    }
  }
}
';
curl -XPUT http://localhost:9200/logstash-2015.05.19 -d '
{
  "mappings": {
    "log": {
      "properties": {
        "geo": {
          "properties": {
            "coordinates": {
              "type": "geo_point"
            }
          }
        }
      }
    }
  }
}
';
curl -XPUT http://localhost:9200/logstash-2015.05.20 -d '
{
  "mappings": {
    "log": {
      "properties": {
        "geo": {
          "properties": {
            "coordinates": {
              "type": "geo_point"
            }
          }
        }
      }
    }
  }
}
';

The accounts data set doesn’t require any mappings, so at this point we’re ready to use the Elasticsearch bulk API to load the data sets with the following commands:

curl -XPOST 'localhost:9200/bank/account/_bulk?pretty' --data-binary @accounts.json
curl -XPOST 'localhost:9200/shakespeare/_bulk?pretty' --data-binary @shakespeare.json
curl -XPOST 'localhost:9200/_bulk?pretty' --data-binary @logs.jsonl

These commands may take some time to execute, depending on the computing resources available.

Verify successful loading with the following command:

curl 'localhost:9200/_cat/indices?v'

You should see output similar to the following:

health status index               pri rep docs.count docs.deleted store.size pri.store.size
yellow open   bank                  5   1       1000            0    418.2kb        418.2kb
yellow open   shakespeare           5   1     111396            0     17.6mb         17.6mb
yellow open   logstash-2015.05.18   5   1       4631            0     15.6mb         15.6mb
yellow open   logstash-2015.05.19   5   1       4624            0     15.7mb         15.7mb
yellow open   logstash-2015.05.20   5   1       4750            0     16.4mb         16.4mb