August 2, 2018

Anonymize-It: The General Purpose Tool for Data Privacy Used by the Elastic Machine Learning Team

Data science work typically begins with a question like “which of these two UX designs will lead to increased sales on my website?” or “how can I best recommend products to customers given their order history?” Often, one of the biggest challenges in data science is finding appropriate data with which to answer these questions. In many cases, the data you want is locked away behind privacy regulations, incredibly messy, or simply hasn’t been collected yet. One way around this is to find an open dataset that is a close approximation for the ideal dataset required to answer your question. However, these open datasets can be biased, or worse, ethically dubious. The best data often comes straight from the source -- the client, customer, or system that owns the data -- but we must take care to ensure that this privately owned data is handled with care.

On the Machine Learning team, we rely on real-world datasets to develop our unsupervised learning models. Many recent improvements to our modeling (change-point detection, forecasting) relied on having genuine datasets shared by users, customers, and internal teams. In this blog, I will discuss a pseudonymization approach that we’ve developed that aims to solve the problem of data acquisition in the scenario that the data owners are unwilling or unable to share because of privacy concerns.

Anonymization

Data privacy has become an increasingly important topic. It seems like not a week goes by without news of another massive hack or data breach. Further, tales of companies selling personally identifiable information for questionable reasons have become more frequent. At Elastic, we take these concerns very seriously, and have completed a year-long initiative to ensure that our products, services, and company are operating in compliance with the principles of GDPR.

In data anonymization, we are concerned with performing operations on fields which could potentially serve as personal-identifiers and quasi-identifiers, whilst maintaining the behavioral characteristics of the data. A personal identifier is an attribute or group of attributes that can uniquely identify an entity, kind of like a primary key. For example, a social security number, a home address, or a passport number. A quasi-identifier (or indirect identifier) is an attribute or group of attributes that are not alone sufficient to uniquely identify an entity, but are sufficiently well correlated with an identity that they can used to aid in identification. Gender, birthday, and postal code, for example.

Anonymization operations on these identifier fields aim to suppress, mask, or generalize the data. Suppression is simply the exclusion of a field. Masking is the replacement of information with artificial identifiers. This can be done by using masks or mappings of real data values to fake ones, for example mapping “Michael Hirsch” to “Amos Dallas”. Of course, “Amos Dallas” could be a real person’s name, but the masking process disassociates identifiers with the entity, Michael Hirsch. Generalization is the process of grouping attributes into higher-order containers, such as mapping zip codes to zip code prefixes, e.g. mapping the zip code 11105 to 111xx.

The difference between anonymization and pseudonymization is that the former irreversibly destroys identifiable data, while the latter substitutes identifiable information with artificial, potentially reversible values.

Let’s take an example, using some example data from an imaginary customer’s web application.

{
   "request": {
       "remote_address": "123.456.789",
       "duration_ms": 919
       },
   "transaction": {
       "name": "GET recommenders.autoEncoder.predict"
       },
   "service": {
       "name": "very-secret-service-name"
   }
}

As part of this hypothetical customer engagement, we would like have access to the dataset that this was sampled from to assist them in creating machine learning (ML) jobs, and potentially to work on new product development. However, the customer may not be able to share this data, as it contains users’ IP addresses, a clue as to how their recommendation system works, and a top secret service name. The customer would like to de-identify this data but doesn’t necessarily have the time or resources to do it on their own.

`anonymize-it`

Simply randomizing the dataset will not suffice, since we want to retain the patterns and semantics of the dataset. For the purposes of development, demonstration, and analytics, we would like human-readable strings to map to human-readable strings. So how can we get from the example above to something that the customer is comfortable sharing? For example:

{
   "request": {
       "duration_ms": 919
       },
   "transaction": {
       "name": "GET recommenders.xxx"
       },
   "service": {
       "name": "llama-boot"
   }
}

anonymize-it is a general purpose tool for suppression, masking, and generalization of fields to aid data pseudonymization. It is composed of three parts: readers, anonymizers, and writers. Readers are responsible for gathering data from the source and preparing it for anonymization tasks, anonymizers perform the masking and generalization of field values, and writers write the anonymized data to the destination. Our first release contains a reader for Elasticsearch, an anonymizer based on Faker, and writers that output data to a local filesystem or to Google Cloud Storage.

Readers

Readers are responsible for picking data up from the source and getting distinct values from fields that will need to be anonymized. Readers ignore any data that should be suppressed. Unsurprisingly, the first reader that we have built is for grabbing data from an Elasticsearch cluster, however we intend to build support for CSVs, Pandas DataFrames, and other potential data sources.

The readers are also responsible for seeding a mapping of unique values. For each unique value, the anonymizer generates artificial data using the provider indicated in the configuration file.

Writers

Writers are responsible for writing data to the destination. Simply, they take the output of the anonymizer and write it to a destination of your choice.

Anonymizers

Now, onto the fun part. We've seen how suppression is handled; we simply ignore the suppressed fields when reading data. Anonymizers are classes that generate artificial data that matches the semantics of the source data. To do this, we make use of a python package called Faker. As stated in the project’s README:

Faker is a Python package that generates fake data for you. Whether you need to bootstrap your database, create good-looking XML documents, fill-in your persistence to stress test it, or anonymize data taken from a production service, Faker is for you.

As an example:

>>> from faker import Faker
>>> f = Faker()
>>> for i in range(5):
...     print(f.slug())
...
dinner-recent-lose
beautiful-rock-look
see-control-single
north-actually
large-difference
>>>
>>> for i in range(5):
...     print(f.ipv4())
...
21.159.49.179
15.168.196.167
225.74.18.191
151.175.171.170
154.72.15.51

Faker does not preserve the mapping dictionary during anonymization. However, If you would like to ensure that Faker generates the same mappings each time, you can seed the data generator. f.seed(919). This makes it incredibly useful for unit testing.

While Faker is quite useful in generating artificial textual data, it does not perform any analysis of the text itself. As an example, it cannot preserve textual prefixes, so it would not be possible to guarantee a mapping of zip codes 11105 and 11415 to two separate strings that share a prefix. This is a form of generalization that anonymize-it aims to provide.

Also, Faker does not provide any features for anonymizing numerical data. Therefore, currently, anonymize-it cannot map numerical fields in a way that preserves the fields' distributions or any correlations to other fields.

Anonymize It!

Let's explore how this works in action.

To run anonymize-it as a script, you must provide a configuration file, for example:

{
 "source": {
   "type": "elasticsearch",
   "params": {
     "host": "host:port",
     "index": "you-index-pattern-*",
     "query": {
       "match": {
         "username": "blaklaybul"
       }
     }
   }
 },
 "dest":{
   "type": "filesystem",
   "params": {
     "directory": "output"
   }
 },
 "include": {
   "service" : "slug",
   "remote_address": "ipv4",
   "@timestamp": null
 },
 "exclude": [],
 "include_rest": false
}

The source object tells anonymize-it how to configure a reader. Each source type has specific optional and necessary parameters. For an Elasticsearch source, you need to provide a host, an index or index pattern, and an optional query. Username and password is entered via the command line when running the anonymization process.

The dest object tells anonymize-it how to configure a writer. Similar to source, a dest object has a type, and each dest type requires specific parameters. For filesystem, all we need is a directory in which to write the anonymized data to.

The rest of the configuration file configures the anonymizer. include instructs anonymize-it which fields should be read from the source, and the Faker providers to use for anonymizing the field's values. This achieves our masking operation. In this example, we would like to anonymize our service name data using Faker's slug provider and our remote_address using the ipv4 provider. @timestamp will be included in the output data, but will not be anonymized since null is indicated as the provider. We are currently working on a provider classifier that can determine how to anonymize fields simply by sampling a few values. This will be available in an upcoming release.

Fields indicated in the exclude array will be suppressed from our process and not included in the written output.

Lastly, include_rest is a boolean that indicates whether you would like to include (without masking) all of the other attributes present in the source data. This is for convenience; rather than enumerating every field in the include array, one can simply set include_rest to True.

Once data has been read in from the reader, masks have been generated, and data has been anonymized using the indicated providers, the writer writes data in batches to the destination.

The source code for anonymize-it is available on Github. Enjoy!

Disclaimer

anonymize-it is intended to serve as a tool to replace real data values with sensical artificial ones such that the semantics of the data are retained. It is not intended to be used for anonymization requirements of GDPR policies, but rather to aid pseudonymization efforts. There may also be some collisions in high cardinality datasets.