And the Winner of the Elastic{ON} 2018 Training Subscription Drawing is...
The Elastic Training team thanks everybody who came to see us at Elastic{ON} 2018 in San Francisco! Everyone that had their badge scanned at the Elastic Training booth was entered into the drawing for an Online Annual Training Subscription. As promised, we made a random selection to determine the winner after indexing the scanned badge data into Elasticsearch. Here’s a quick breakdown of how we did it.
As with all development, we started with a workflow:
- Remove duplicate entries — in case any of you tried to get sneaky and put multiple entries in there ;-)
- Filter out Elastic employees (we should know this stuff already, it’s our job).
- Redact all Personally Identifiable Information (PII).
- Index the list of attendee emails as documents into Elasticsearch.
- Pick a lucky winner at random.
To get started, we indexed the documents using the _bulk API, making this a purely Elasticsearch solution. Then, for filtering, we first used the ticket ID to reconcile duplicates. Then we used the email domain to filter out elastic.co entries, routing those to a separate index. Finally, we redacted all PII.
For the domain extraction, we first needed to enable regex in elasticsearch.yml
and then restart the node:
# permit regex script.painless.regex.enabled: true
Then we added an ingest pipeline which used Painless to:
- Set the document
_id
to the ticket ID (no PII) - Extract the domain from the email address as a new field
- Route to index
employee_ineligible
orlead_eligible
on the basis of domain - Redact all potential PII (including the domain)
Like this:
PUT _ingest/pipeline/email_to_id_route_and_redact { "description": "use Ticket_Reference_ID as _id to dedupe, route by domain and redact ALL potential PII", "processors": [ { "script": { "source": """ // set document id to Ticket_Reference_ID ctx._id = ctx.Ticket_Reference_ID; // extract domain from email def domain = /.*@/.matcher(ctx.Email).replaceAll(''); // if domain == elastic.co, route to ineligible index ctx._index = (domain == 'elastic.co') ? 'employee_ineligible' : 'lead_eligible'; // get the keys (=document property names) Set keys = new HashSet(ctx.keySet()); // iterate over the document properties these keys represent Iterator properties = keys.iterator(); while (properties.hasNext()) { def property = properties.next(); // extract the field prefix def prefix = property.substring(0,1); // redact fields not generated by ES, excepting Ticket_Reference_ID if(!(prefix.equals('_') || prefix.equals('@') || property.equals('Ticket_Reference_ID'))) // set those fields to pii_redacted ctx[property] = 'pii_redacted'; } """ } } ] }
As a best practice, we used the _simulate API to test our pipeline before running the full ingest, so we sent two test documents, from eligible and ineligible entries.
POST _ingest/pipeline/email_to_id_route_and_redact/_simulate { "docs" : [ { "_index": "index", "_type": "doc", "_source": { "field1":"test", "field2":"test", "Email":"ineligible:-(@elastic.co", "Ticket_Reference_ID":"ABCDE12345"} }, { "_index": "index", "_type": "doc", "_source": { "field1":"test", "field2":"test", "Email":"eligible:-)@not_elastic.suffix", "Ticket_Reference_ID":"FGHIJ67890"} } ] }
The output confirmed that the pipeline mapped the document _id
, routed to the correct indices, and redacted all potential PII.
{ "docs": [ { "doc": { "_index": "employee_ineligible", "_type": "doc", "_id": "ABCDE12345", "_source": { "field1": "pii_redacted", "Email": "pii_redacted", "field2": "pii_redacted", "Ticket_Reference_ID": "ABCDE12345" }, "_ingest": { "timestamp": "2018-04-19T16:24:13.071Z" } } }, { "doc": { "_index": "lead_eligible", "_type": "doc", "_id": "FGHIJ67890", "_source": { "field1": "pii_redacted", "Email": "pii_redacted", "field2": "pii_redacted", "Ticket_Reference_ID": "FGHIJ67890" }, "_ingest": { "timestamp": "2018-04-19T16:24:13.071Z" } } } ] }
To avoid any inconsistency due to sharding or routing given the small data set, we added a dynamic template to ensure single-sharded indexes:
# clean up DELETE *eligible* PUT _template/single_shard { "index_patterns" : "*eligible", "order" : 1, "settings": { "number_of_shards": 1, "number_of_replicas": 0 } }
And then Ingested the data using the _bulk API:
PUT _bulk { "index" : { "_index" : "", "_type" : "doc", "pipeline": "email_to_id_route_and_redact"} } { "Exhibitor_Name":"Elastic", "First_Name":" ...
Next step was to count the eligible and ineligible candidates using a terms aggregation:
GET *eligible/_search { "size": 0, "aggs" : { "eligibility" : { "terms" : { "field" : "_index" } } } }
The results were:
"eligibility": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": "lead_eligible", "doc_count": 393 }, { "key": "employee_ineligible", "doc_count": 9 } ] }
Finally, fairness and repeatability were ensured by sourcing the seed for the random score using a random number generated from terminal: echo $RANDOM
. This returned the value 8121
.
To pick our lucky winner, we queried the documents using random_score function_score query:
GET lead_eligible/_search { "size": 1, "_source": "T*", "query": { "function_score": { "functions": [ { "random_score": { "seed": 8121, "field": "_id" } } ] } } }
Which resulted in:
"hits": [ { "_index": "lead_eligible", "_type": "doc", "_id": "FWNVN5VRB2L", "_score": 0.9928309, "_source": { "Ticket_Type": "pii_redacted", "Transcription_Status": "pii_redacted", "Title": "pii_redacted", "Ticket_Reference_ID": "FWNVN5VRB2L" } } ]
Congratulations FWNVN5VRB2L
, aka Stephen Steck! We look forward seeing you in a virtual classroom soon!
For anyone interested in trying this out on their own, all of the above requests and some sample data can be found in this Github Gist.