17 May 2018 News

And the Winner of the Elastic{ON} 2018 Training Subscription Drawing is...

By Dominic Page

The Elastic Training team thanks everybody who came to see us at Elastic{ON} 2018 in San Francisco! Everyone that had their badge scanned at the Elastic Training booth was entered into the drawing for an Online Annual Training Subscription. As promised, we made a random selection to determine the winner after indexing the scanned badge data into Elasticsearch. Here’s a quick breakdown of how we did it.

As with all development, we started with a workflow:

  1. Remove duplicate entries — in case any of you tried to get sneaky and put multiple entries in there ;-)
  2. Filter out Elastic employees (we should know this stuff already, it’s our job).
  3. Redact all Personally Identifiable Information (PII).
  4. Index the list of attendee emails as documents into Elasticsearch.
  5. Pick a lucky winner at random.

To get started, we indexed the documents using the _bulk API, making this a purely Elasticsearch solution. Then, for filtering, we first used the ticket ID to reconcile duplicates. Then we used the email domain to filter out elastic.co entries, routing those to a separate index. Finally, we redacted all PII.

For the domain extraction, we first needed to enable regex in elasticsearch.yml and then restart the node:

# permit regex
script.painless.regex.enabled: true

Then we added an ingest pipeline which used Painless to:

  • Set the document _id to the ticket ID (no PII)
  • Extract the domain from the email address as a new field
  • Route to index employee_ineligible or lead_eligible on the basis of domain
  • Redact all potential PII (including the domain)

Like this:

PUT _ingest/pipeline/email_to_id_route_and_redact
{
  "description": "use Ticket_Reference_ID as _id to dedupe, route by domain and redact ALL potential PII",
  "processors": [
    {
      "script": {
        "source": """
// set document id to Ticket_Reference_ID
ctx._id = ctx.Ticket_Reference_ID;
// extract domain from email
def domain = /.*@/.matcher(ctx.Email).replaceAll('');
// if domain == elastic.co, route to ineligible index
ctx._index = (domain == 'elastic.co') ? 
  'employee_ineligible' : 'lead_eligible';
// get the keys (=document property names)
Set keys = new HashSet(ctx.keySet()); 
// iterate over the document properties these keys represent
Iterator properties = keys.iterator();
while (properties.hasNext()) {
  def property = properties.next();
  // extract the field prefix
  def prefix = property.substring(0,1);
  // redact fields not generated by ES, excepting Ticket_Reference_ID
  if(!(prefix.equals('_') || 
    prefix.equals('@') || 
    property.equals('Ticket_Reference_ID')))
      // set those fields to pii_redacted  
      ctx[property] = 'pii_redacted';
}
"""
      }
    }
  ]
}

As a best practice, we used the _simulate API to test our pipeline before running the full ingest, so we sent two test documents, from eligible and ineligible entries.

POST _ingest/pipeline/email_to_id_route_and_redact/_simulate
{
  "docs" : [
    { "_index": "index",
      "_type": "doc",
      "_source": { 
        "field1":"test", 
        "field2":"test", 
        "Email":"ineligible:-(@elastic.co", 
        "Ticket_Reference_ID":"ABCDE12345"} },
    { 
      "_index": "index",
      "_type": "doc",
      "_source": { 
        "field1":"test", 
        "field2":"test", 
        "Email":"eligible:-)@not_elastic.suffix", 
        "Ticket_Reference_ID":"FGHIJ67890"} }
  ]
}

The output confirmed that the pipeline mapped the document _id, routed to the correct indices, and redacted all potential PII.

{
  "docs": [
    {
      "doc": {
        "_index": "employee_ineligible",
        "_type": "doc",
        "_id": "ABCDE12345",
        "_source": {
          "field1": "pii_redacted",
          "Email": "pii_redacted",
          "field2": "pii_redacted",
          "Ticket_Reference_ID": "ABCDE12345"
        },
        "_ingest": {
          "timestamp": "2018-04-19T16:24:13.071Z"
        }
      }
    },
    {
      "doc": {
        "_index": "lead_eligible",
        "_type": "doc",
        "_id": "FGHIJ67890",
        "_source": {
          "field1": "pii_redacted",
          "Email": "pii_redacted",
          "field2": "pii_redacted",
          "Ticket_Reference_ID": "FGHIJ67890"
        },
        "_ingest": {
          "timestamp": "2018-04-19T16:24:13.071Z"
        }
      }
    }
  ]
}

To avoid any inconsistency due to sharding or routing given the small data set, we added a dynamic template to ensure single-sharded indexes:

# clean up
DELETE *eligible*
PUT _template/single_shard
{
  "index_patterns" : "*eligible",
  "order" : 1,
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  }
}

And then Ingested the data using the _bulk API:

PUT _bulk
{ "index" : { "_index" : "", "_type" : "doc", "pipeline": "email_to_id_route_and_redact"} }
{ "Exhibitor_Name":"Elastic", "First_Name":" ...

Next step was to count the eligible and ineligible candidates using a terms aggregation:

GET *eligible/_search
{
  "size": 0,
  "aggs" : {
    "eligibility" : {
      "terms" : { "field" : "_index" }
    }
  }
}

The results were:

"eligibility": 
{
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "lead_eligible",
          "doc_count": 393
        },
        {
          "key": "employee_ineligible",
          "doc_count": 9
        }
      ]
}

Finally, fairness and repeatability were ensured by sourcing the seed for the random score using a random number generated from terminal: echo $RANDOM. This returned the value 8121.

To pick our lucky winner, we queried the documents using random_score function_score query:

GET lead_eligible/_search
{
  "size": 1,
  "_source": "T*", 
  "query": {
    "function_score": {
      "functions": [
        {
          "random_score": {
            "seed": 8121,
            "field": "_id"
          }
        }
      ]
    }
  }
}

Which resulted in:

"hits": [
      {
        "_index": "lead_eligible",
        "_type": "doc",
        "_id": "FWNVN5VRB2L",
        "_score": 0.9928309,
        "_source": {
          "Ticket_Type": "pii_redacted",
          "Transcription_Status": "pii_redacted",
          "Title": "pii_redacted",
          "Ticket_Reference_ID": "FWNVN5VRB2L"
        }
      }
    ]

and_the_winner_is.jpg

Congratulations FWNVN5VRB2L, aka Stephen Steck! We look forward seeing you in a virtual classroom soon!

For anyone interested in trying this out on their own, all of the above requests and some sample data can be found in this Github Gist.