It's a little longer than two years since I hung up my proverbial data scientist boots. I then moved from being a practitioner in open data and data sharing to more of a thinker and facilitator in the same space. Since then, I've been working for projects such as the European Data Portal and the Support Centre for Data Sharing itself for the European Commission. It is from that original experience that I learned how easy it is for data to be not just biased, but simply, unfortunately, wrong.
Many of us like to think that everything can be solved by some piece of technology. Unfortunately, it is not like that. Data collection and processing is a human process that depends deeply on the human component. Sensors and Internet of Things (IoT) devices are great, but humans decide where to put them and what to measure, the degree of detail that we presume is useful, and the data format it is collected into. Digital data transfers are close to perfect... but that makes them perfect at preserving those mistakes and bias we created at the source. And then, at some point in the process, a data scientist will come and start fiddling with the data, adding more error, their interpretation and personal bias.
Now, imagine how easy it is through this series of stages and the many hands the data changes, to also maliciously bend the data and its interpretation in a direction your organisation or you yourself want to exploit. For example, we could inject fabricated data into a dataset describing climate change. Or, we may alter the outlook on a medical emergency by changing how we attribute a patient's death to a cause or another. We could even make unemployment statistics look less severe to back the political direction of one party or another.
Of course, this is not completely new. Satirical Italian poet Trilussa already described back in the early 1900s how statistics can be manipulated. He argued that if, on average, each of us gets one chicken for dinner, some of us will likely savour two, and someone else none.1 At the time, this was associated with the malice of some politicians.
What is changing these days is twofold. On one side we can claim that our understanding of the world and our decisions were never more grounded in facts - and in the data describing it - than before. On the other, there is so much data – and so little effort in preserving its integrity – that it is possible to change the information at the base of the reasoning itself. We can make two chickens look like three, or four, or 100.
Over the last few years, the issue of fake news has come to most of our attention. Occasionally, we are amused by how skilful the makers of a deep fake video are. But, we never talk about fake data. It is time we prepare hard to deal with more of it, and more often. SCDS is doing its part to help. Later this year or in early 2021, we will be publishing our report on the available technology to document the traceability of data through its provenance and lineage: respectively where it comes from and what transformations it went through before falling into your lap. But be aware: tech wonders cannot replace our commitment to trustworthy data, information and the interpretation thereof.
I was discussing this on stage a few days ago, presenting at the United Nations' "AI for Good conference"2. Many in government and industry are worrying about an artificial intelligence (AI) "arms race", but it sounds silly when the data we feed to those AIs – to train them or to enable them to make decisions for us – is so bad (if it’s even real data) in the first place. It is like worrying about designing bigger and faster sportscars during an oil crisis. Keep that in mind, and make the fight for better data your first fight, too.