08 March 2018 Engineering

A Quick Flight over GDPR: Ten topics from our GDPR & Elasticsearch session at Elastic{ON} 2018

By Mike Paquette

One of my favorite parts of our Elastic{ON} user conference is the Birds of a Feather sessions. There’s something about the community of Elastic users, and their willingness to share stories and experiences, that makes these sessions interesting and fun, *even* if the topic is something relatively dry, such as regulatory compliance!

I had the opportunity to moderate one of these sessions where the topic was GDPR, the European Union’s General Data Protection Regulation, which goes into effect on May 25, 2018.

As compliance discussions go, this one was unusual in the sense that not one person shared an audit story! Of course this makes sense, since the GDPR regulation is not yet being enforced, and there’s no history of audit findings, court rulings, or fines levied.

I estimate that there were about 40-50 people in the room. The format was an open discussion, but I had the opportunity to sprinkle in a few questions and jot down some notes. While this is the furthest thing from a scientific poll, I thought I’d share some of the discussion points and anecdotes in the form of a Q&A, with the A’s representing my perception of the sentiment of the group.

Q1: How would you describe your Elastic Stack use case that is affected by GDPR?

A1: While GDPR affects all kinds of personal data, this discussion focused on logging-related use cases, with IT operational analytics and security analytics cited by many participants. Others mentioned application and APM logging. Interestingly, several participants said their GDPR-affected use of the Elastic Stack spanned multiple use cases.

Q2: With respect to GDPR, is your organization a data processor, a data controller, or both?

A2: While a few birds claimed only the data processor role or the data controller role, the vast majority responded that their organization played both data processor and data controller roles.

Q3: GDPR recommends pseudonymization as one method to protect personal data - What kinds of Personal Data are you planning to pseudonymize that might not be obvious?

A3: A few respondents mentioned telephone numbers and user IDs. Many responded with session ID information, such as that coming from application logs, load balancer logs, web servers logs, and APM. Others mentioned file-paths in access logs (e.g., user/mikepaquette), and even latitude/longitude coordinates and other geoIP information. IP address treatment was a major topic of discussion.

Q4: Again with respect to pseudonymization and other treatment of personal data, what methods are you considering to pseudonymize data?

A4: Respondents discussed a number of techniques, including removing it from logs altogether, masking the data (almost all reported they plan to use this approach), and marking/tagging the data and controlling access. Some attendees reported “all of the above,” depending on the data.

Q5: Are you using any automated means to “detect” personal data on an ongoing basis (e.g., checking fields that don’t normally contain personal data)?

A5: An interesting related topic was scanning application error logs that might contain personal data (username/password, transaction details etc.) before allowing the logs to be seen at by analysts. One respondent had built scripts and queries to try to identify common types of data that might be present.

Q6: Do you plan to use encryption at rest on the systems on which the Elastic Stack was installed?

A6: Most folks in the room were unsure of the necessity of employing encryption at rest on their Elasticsearch clusters systems. Participants said that this issue was closely related to the pseudonymization discussion. They asked if the personal data was cryptographically pseudonymized, why do you need to encrypt the disk too? Although others mentioned that good security control might include disk-based encryption. A couple participants presumed that volume-based encryption might be too slow.

Q7: What kind of access controls do you expect to deploy on Elastic Stack systems processing Personal Data?

A7: Most attendees said they planned to use X-Pack security features to implement access controls to Elasticsearch clusters that processed personal data, including X-Pack authentication and role-based access controls (RBAC). An interesting idea came up regarding geo-based access controls for limiting access to data depending on where the user was located. Our product team was there to take that idea back for investigation!

Q8: If you’ve determined that you need to pseudonymize IP addresses in your logs, are you considering or using any special treatment of IP addresses to compensate for analyst inability to use them in detections and investigations (e.g., range calculations, subnet matching)?

A8: This topic prompted lots of discussion. Most approaches involved simply hashing the IP address fields (e.g., using Logstash fingerprint filter), so correlation can still happen within the Elastic Stack, but limiting the ability to view the actual IP address to certain staff. Many mentioned similar hashing, but also adding additional enrichment (e.g., geoIP, including ASN) to events to compensate for the inability to see the actual IP address. Technical approaches ranged from using strong encryption to using “special” pseudonymization functions that preserve subnet and range visibility.

Q9: Do you plan to use automated retention period enforcement techniques on your data in the Elastic Stack?

A9: Only one person described how their organization had automated the removal of personal data by automatic index deletion after the retention period had transpired. Many seemed to be approaching this with manual processes at first, and would consider automation at a later time, when processes were more mature.

Q10: We know that GDPR recommends securing personal data according to its risk. One could argue that in some cases, the disclosure of an IP address that was present in a log file would be a pretty minor privacy risk. Is your organization considering NOT pseudonymizing IP addresses based on this logic?

A10: Although several folks agreed that the impact to privacy was probably low, it seemed that the main reason for still pseudonymizing IP addresses was worry that their legal team might not agree with them!

So there you have it - a quick flight over some some current GDPR topics as shared by our birds of a feather. The 45-minutes seemed to “fly” by!

We’ll be discussing some of these topics and more during our upcoming GDPR webinar. Why not participate to learn more and join in the live Q&A?

In the meantime, please check out our GDPR white paper, in which we provide some GDPR background information and describe a process that organizations can use to prepare for GDPR.