May 17, 2018News

And the Winner of the Elastic{ON} 2018 Training Subscription Drawing is...

The Elastic Training team thanks everybody who came to see us at Elastic{ON} 2018 in San Francisco! Everyone that had their badge scanned at the Elastic Training booth was entered into the drawing for an Online Annual Training Subscription. As promised, we made a random selection to determine the winner after indexing the scanned badge data into Elasticsearch. Here’s a quick breakdown of how we did it.

As with all development, we started with a workflow:

Remove duplicate entries — in case any of you tried to get sneaky and put multiple entries in there ;-)
Filter out Elastic employees (we should know this stuff already, it’s our job).
Redact all Personally Identifiable Information (PII).
Index the list of attendee emails as documents into Elasticsearch.
Pick a lucky winner at random.

To get started, we indexed the documents using the _bulk API, making this a purely Elasticsearch solution. Then, for filtering, we first used the ticket ID to reconcile duplicates. Then we used the email domain to filter out elastic.co entries, routing those to a separate index. Finally, we redacted all PII.

For the domain extraction, we first needed to enable regex in elasticsearch.yml and then restart the node:

# permit regex
script.painless.regex.enabled: true

Then we added an ingest pipeline which used Painless to:

Set the document _id to the ticket ID (no PII)
Extract the domain from the email address as a new field
Route to index employee_ineligible or lead_eligible on the basis of domain
Redact all potential PII (including the domain)

Like this:

PUT _ingest/pipeline/email_to_id_route_and_redact
{
  "description": "use Ticket_Reference_ID as _id to dedupe, route by domain and redact ALL potential PII",
  "processors": [
    {
      "script": {
        "source": """
// set document id to Ticket_Reference_ID
ctx._id = ctx.Ticket_Reference_ID;
// extract domain from email
def domain = /.*@/.matcher(ctx.Email).replaceAll('');
// if domain == elastic.co, route to ineligible index
ctx._index = (domain == 'elastic.co') ? 
  'employee_ineligible' : 'lead_eligible';
// get the keys (=document property names)
Set keys = new HashSet(ctx.keySet()); 
// iterate over the document properties these keys represent
Iterator properties = keys.iterator();
while (properties.hasNext()) {
  def property = properties.next();
  // extract the field prefix
  def prefix = property.substring(0,1);
  // redact fields not generated by ES, excepting Ticket_Reference_ID
  if(!(prefix.equals('_') || 
    prefix.equals('@') || 
    property.equals('Ticket_Reference_ID')))
      // set those fields to pii_redacted  
      ctx[property] = 'pii_redacted';
}
"""
      }
    }
  ]
}

As a best practice, we used the _simulate API to test our pipeline before running the full ingest, so we sent two test documents, from eligible and ineligible entries.

POST _ingest/pipeline/email_to_id_route_and_redact/_simulate
{
  "docs" : [
    { "_index": "index",
      "_type": "doc",
      "_source": { 
        "field1":"test", 
        "field2":"test", 
        "Email":"ineligible:-(@elastic.co", 
        "Ticket_Reference_ID":"ABCDE12345"} },
    { 
      "_index": "index",
      "_type": "doc",
      "_source": { 
        "field1":"test", 
        "field2":"test", 
        "Email":"eligible:-)@not_elastic.suffix", 
        "Ticket_Reference_ID":"FGHIJ67890"} }
  ]
}

The output confirmed that the pipeline mapped the document _id, routed to the correct indices, and redacted all potential PII.

{
  "docs": [
    {
      "doc": {
        "_index": "employee_ineligible",
        "_type": "doc",
        "_id": "ABCDE12345",
        "_source": {
          "field1": "pii_redacted",
          "Email": "pii_redacted",
          "field2": "pii_redacted",
          "Ticket_Reference_ID": "ABCDE12345"
        },
        "_ingest": {
          "timestamp": "2018-04-19T16:24:13.071Z"
        }
      }
    },
    {
      "doc": {
        "_index": "lead_eligible",
        "_type": "doc",
        "_id": "FGHIJ67890",
        "_source": {
          "field1": "pii_redacted",
          "Email": "pii_redacted",
          "field2": "pii_redacted",
          "Ticket_Reference_ID": "FGHIJ67890"
        },
        "_ingest": {
          "timestamp": "2018-04-19T16:24:13.071Z"
        }
      }
    }
  ]
}

To avoid any inconsistency due to sharding or routing given the small data set, we added a dynamic template to ensure single-sharded indexes:

# clean up
DELETE *eligible*
PUT _template/single_shard
{
  "index_patterns" : "*eligible",
  "order" : 1,
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  }
}

And then Ingested the data using the _bulk API:

PUT _bulk
{ "index" : { "_index" : "", "_type" : "doc", "pipeline": "email_to_id_route_and_redact"} }
{ "Exhibitor_Name":"Elastic", "First_Name":" ...

Next step was to count the eligible and ineligible candidates using a terms aggregation:

GET *eligible/_search
{
  "size": 0,
  "aggs" : {
    "eligibility" : {
      "terms" : { "field" : "_index" }
    }
  }
}

The results were:

"eligibility": 
{
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "lead_eligible",
          "doc_count": 393
        },
        {
          "key": "employee_ineligible",
          "doc_count": 9
        }
      ]
}

Finally, fairness and repeatability were ensured by sourcing the seed for the random score using a random number generated from terminal: echo $RANDOM. This returned the value 8121.

To pick our lucky winner, we queried the documents using random_score function_score query:

GET lead_eligible/_search
{
  "size": 1,
  "_source": "T*", 
  "query": {
    "function_score": {
      "functions": [
        {
          "random_score": {
            "seed": 8121,
            "field": "_id"
          }
        }
      ]
    }
  }
}

Which resulted in:

"hits": [
      {
        "_index": "lead_eligible",
        "_type": "doc",
        "_id": "FWNVN5VRB2L",
        "_score": 0.9928309,
        "_source": {
          "Ticket_Type": "pii_redacted",
          "Transcription_Status": "pii_redacted",
          "Title": "pii_redacted",
          "Ticket_Reference_ID": "FWNVN5VRB2L"
        }
      }
    ]

Congratulations FWNVN5VRB2L, aka Stephen Steck! We look forward seeing you in a virtual classroom soon!

For anyone interested in trying this out on their own, all of the above requests and some sample data can be found in this Github Gist.