Data provenance & lineage: technical guidance on the tracing of data - Part 1
In the first part of the guidance on data provenance and linage, we introduce the topic and explore its essential concepts.
With the proliferation of data sharing, data analytics, and the diffusion of data in decision-making, a crucial concern has settled in the minds of decision makers, technologists, and analysts: How can you trust the data that you rely on? Trust in data relies on many variables, such as knowing where data comes from and understanding how it is processed. Answering these questions allows data users to approximate whether e.g. the claimed origins of data are genuine, whether data has been tampered with, or whether it is being used in ways that are explainable and legitimate. Data provenance and lineage address many of these issues, ensuring that data remains traceable.
This report provides an integrated perspective and guidance on how to ensure the traceability of data in practice. It is meant to be an information primer and guide, aimed at domain experts and decision-makers tasked with defining their organisation’s approach to data traceability. Practitioners should gain a more structured, categorical view of the relevant concepts, challenges, and technologies. To realise this objective, the report provides readers with:
In the first part of the guidance on data provenance and linage, we introduce the topic and explore its essential concepts.
The second part of the guidance on data provenance and lineage explains typical scenarios for the tracing of data and explains how these relate to different system settings.
The final part of the guidance on data provenance and lineage discusses important applications to ensure data provenance and lineage in different settings.