It happens every time. I'm there, trying to discuss constructively the challenges of businesses or governments doing data sharing, when someone plays the "GDPR card". You can't really share any data - they say - there will always be some degree of personal information in it and even when there is not, the data can be used in combination with other sources and then become revealing. I understand where the objection comes from - and I myself have helped demonstrate the fragility of bad anonymisation while I was still a practising data scientist, but that is not a good reason to give up on putting data to good use or to share with someone who could.
At the kind of geeky social situations I attend, the General Data Protection Regulation (GDPR) that came into force in the European Union in May 2018 is the most effective of party breakers. We all agree: data literacy is at terrible levels and it is so easy for people who have access to sensitive data - even with the best intentions - to make a mistake or be uncaring and compromise people's privacy. Even when that is not the case, computers and network security will fail at some point and malicious actors will access the systems and steal the data. What's GDPR's solution to that specific problem? Minimisation: in short, the principle by which you should delete any data that is not strictly "adequate, relevant and limited to what is necessary in relation to the [original] purposes for which they are processed" (article 5.1.c).
As much as I love the GDPR and believe that we owe the EU so much of the vocabulary and conversation we have around privacy these days, it is also my opinion that minimisation is a solution to personal data protection in the same way a flamethrower is a solution to starting a bbq fire. It's a knee-jerk reaction, a surrender to the idea that data processors are just unethical and can't be trusted to behave and/or implement proper computer security. It's like saying that the only way to get someone with access to data to be respectful of people's privacy is to force them to delete it in the first place. I would have loved to be a fly on the walls of the rooms in Berlaymont Building where article 5 was written, to observe the dynamics of those conversations that made minimisation look like the only feasible option.
Malcom Gladwell, in "The Basement Tapes" episode of his Revisionist History podcast 1tells the amazing story of Ivan Frantz Jr, a doctor and medical researcher who run the "National Diet Heart Study" over five years in the late 1960's and early 1970's: a massive controlled clinical trial. The data collected through his work showed the negative effects of vegetable oils rich of poly unsaturated fats (corn, sunflower, margarine and others) that - in Frantz's time - were arbitrarily believed to be healthier than animal fat, then more common in US cooking practices. Fact is, Dr Frantz did not make the discovery, but Christopher Ramsden: a researcher at the US National Institutes of Health, almost a quarter of a century later. What made his discovery possible? The data Dr Frantz abandoned in the basement of his old house, and that his son Robert helped Ramsden recover.
Call me naïve, but I see the development of data literacy and skills as a better solution to minimisation. We need to help people who work with data understand the responsibility and risk that comes with their duties, in the same way that one needs to take a licence and an exam before she can drive a car. True, getting people to develop skills and an ethical sense for data takes more time and is more expensive than pressing on the "Delete" button, but what is the potential value we are losing by doing so? Modern statistic, that was not available to Dr Frantz, revealed the precious truth hidden in otherwise inconclusive personal data from the '60s. What will AI reveal tomorrow from data we are destroying today?
Just for reference, the transcript of the podcast is: https://blog.simonsays.ai/the-basement-tapes-with-malcolm-gladwell-s2-e10-revisionist-history-podcast-transcript-d764d0472079?gi=48083c9d8c7c
- 1. See: http://revisionisthistory.com/episodes/20-the-basement-tapes