Przejdź do treści

Data provenance & lineage: technical guidance on the tracing of data - Part 1

1 Introduction

With the proliferation of data sharing, data analytics, and the diffusion of data in decision-making, a crucial concern has settled in the minds of decision makers, technologists, and business analysts: How can you trust the data that you rely on? Answering that question is not straightforward because trust itself is, mathematically speaking, a function. Whether and to what degree trust in data is built relies on other variables, such as knowing where data comes from and understanding how it is processed. Answering these questions allows data users to approximate whether e.g. the claimed origins of data are genuine, whether data has been tampered with, or whether it is being used in ways that are explainable and legitimate.

In practice, questions about the origins of data and their processing are answered through the tracing of the provenance and lineage of data. For art historians and dealers, provenance is an indispensable, inherent interest to pre-empt art forgery and to allow the restitution of confiscated or stolen art.1 In logistics and food processing, verifying the origins and the lineage of how ingredients are processed is a wide-ranging matter of public health as exemplified by the Europe-wide horse meat scandal in 2013.2

In the area of data analytics, the picture is less clear however. Tracing the evolution and origins of data is fundamentally different from doing the same for tangible artefacts in the analogue world. Interestingly, in practice, the resolution of data provenance and lineage usually is just one process: tracing how data has evolved (e.g. computations done on it) is inseparable from understanding where data comes from. This is because various records of the inputs, entities, systems, and processes executed on the data are crucial to both trace the evolution of data and to understand the origins of the intangible, perfectly replicable asset that is data. The popularity of Distributed Ledger and Blockchain technologies shows the substantive interest of some high-value and high-security domains - and that conceptually viable technologies exist. In other domains, though, the trend is being picked up more slowly – probably correlating with a (perceived) lower importance of data in the value creation process of organisations that populate the respective sectors.

While the areas of data provenance and lineage are well-established concepts, the tracing of data remains a complex, nascent area of practice. One notable characteristic of this domain is that solutions are fragmented, with some well-known applications, such as Public Key Infrastructures (PKI) offering at least partial solutions to the tracing riddle. A second characteristic is that sourcing concepts and analogies from other sectors (or disciplines) is a major source of progress in data provenance and lineage. However, together with rapidly changing user requirements and new technologies, this has also led to an unstructured, sometimes confusing proliferation of ideas and concepts.

Against this backdrop, this report provides an integrated perspective and guidance on realising data traceability in practice. It is meant to be an information primer and guide, aimed at domain experts and decision-makers tasked with defining their organisation’s approach to data traceability. Practitioners should gain a more structured, categorical view of the relevant concepts, challenges, and technologies. To realise this objective, the report provides readers with:

  • An in-depth understanding of data provenance and lineage;
  • a clear sense for common applications areas and challenges; and
  • insights into approaches that allow the tracing of data provenance and lineage.

Despite these ambitions a word of caution is needed: The tracing of data provenance and lineage remains an area where mass-market solutions are mostly unknown. In fact, many issues and potential use cases are not (yet) solvable, but there is active research and development. Many problems might also not be solvable purely from a technology perspective. This report focusses on existing solutions and selected emerging approaches already available to practitioners.

The remainder of this report is structured as follows:

Chapter 2 provides an overview of the essential concepts that are either directly relevant or closely related to the tracing of data. Chapter 3 introduces five practical use cases and solution approaches in which the tracing of data provenance and lineage are particularly important. Chapter 4 structures the different system environments and their implications for the tracing of provenance and lineage. Chapter 5 gives an outlook on five different applications that help to achieve the tracing of data in practice.

 

2 Essential concepts

Data provenance and lineage are closely related to a variety of other concepts. This section explains the most essential concepts, clarifying how these are related to provenance and lineage.

 

2.1 Data provenance, lineage, and traceability

The terms data provenance, lineage and – less commonly - traceability are often used interchangeably3. Depending on the source, their definitions may slightly differ. Unlike data provenance and data lineage, data traceability is not considered to be an established term. The notion of traceability is more common in the areas of logistics, production process design, and requirements engineering, where it essentially describes the ability to trace the application or disposition of certain artefacts (e.g. technical requirements or goods). In the area of data processing, this tracing activity is addressed via data provenance and lineage. Therefore, this report will not reference the term traceability. However, we will use the verb to trace as this describes the concerned activity.

Data provenance/lineage refers to "the process of tracing and recording the origin of data and its movement between databases"4. Lineage tracing can be distinguished into two types: forward and backward. These terms describe where tracing originates. Forward tracing begins at the data source, whilst backward tracing begins at the latest version. Thus, it can be either used to trace in which datasets a piece of information is present (forward tracing), or where a piece of information has its origins (backward tracing). Directed acyclic graphs (DAGs) are often used to trace data origin relationships5. Understanding the provenance of data generated by complex transformations is of particular interest and value. It enables the assessment of data quality based on preceding versions and derivatives, as well as tracing back sources for errors. Additionally, data provenance can also be used to break down data sources in data warehouses, to track the creation of intellectual property, and to provide an audit trail.6

In the context of banking, data lineage would refer to keeping track of a deposit for its entire lifetime, i.e. from the moment it is made by a customer until the funds are again withdrawn. This also includes any changes made to the deposit in the meantime, for example partial withdrawals or transfers to other banks.

 

2.2 Data audit and data access audit

Data auditing can come in two shapes. First, it can be a process in which data is evaluated against various criteria related to a specific purpose. This evaluation covers aspects like the intended data usage, existing data quality, and data curation methodologies. Second, data access audits are processes for the logging of modifications of dataset. This can include the tracking of applications and users that accessed data.7 Tracking users can be relevant when reproducing who has read and/or modified data at what time, and in what order.

Data audits in the sense of evaluations can be useful in a variety of scenarios. For example, assessments of data quality can provide insights on the utility of data. This can be done automatically via technical means, as is the case in the European Data Portal8. For the tracking of data modifications, Blockchains are probably the most prominent application. Finally, data access audits are common in areas of highly sensitive information. For example, the US Health Insurance Portability and Accountability Act (HIPAA) of 1996 requires that health care providers in the United States must be able to provide a list of every party who has read an individual’s health record.

 

2.3 Digital Rights Management

Tracking the lifecycle of data can also help enforcing legitimate usage and prevent digital piracy. For this, Digital Rights Management (DRM), a "group of access control technologies that are used to manage the use of copyrighted materials and to prevent the illegitimate access to digital contents"9, can be used. More precisely, DRM is – among others - responsible for handling the aspects of authentication and authorization, licensing and payment, and well as usage control.10 DRM therefore describes the interaction between three entities: Users, content, and rights.11 In this model rights, are assigned to a combination of users and content; i.e. what users can access and which content.  DRM policies can be enforced using dedicated frameworks.12

Classic uses of DRM can be found in media, for example when online streaming services use technical measures to ensure their media can only be played from a limited number of devices at a time. Likewise, in cases of online services redistributing digital goods, existing DRM policies enforced by the original providers cannot be lost during transfer.

 

2.4 Rights over data and data usage

Rights over data are crucial because they define who has control over data, can transfer and access these. Rights over data can refer to two aspects: possession and control of data. This includes the right to transfer access to and allow usage by third parties. Therefore, rights over data defines the owner's ability to assign, share, or relinquish such privileges.13 Data usage refers to permissions to access, create and modify data.14

Rights over data are a central concern of the General Data Protection Regulation (GDPR) adopted in 2016 by the European Union. Data usage is typically implemented by file server software. For each file there is a set of rules which state which users or groups may read and/or write to it.

 

2.5 Data governance

Data governance can be defined as “the use of authority combined with policy to ensure the proper management of information assets”15. Data governance usually defines policies, standards, and procedures to ensure data quality as well as to allow compliance monitoring.16 This makes data governance a crucial instrument for data quality management. Data may be a company’s most valuable asset.17 High data quality, ensured through consistent data governance, warrants that the utility of data is maintained and maximised. As such, data governance is vital for conforming to the General Data Protection Regulation (GDPR). It may not be long before a number of aspects of data governance are enforced by laws and regulations, as foreshadowed by the GDPR (see below in 3.3) and by a proposal on introducing a Data Governance Act in the EU.18

In collaborative editing, data governance plays a big role. A prime example is Wikidata19, the knowledge base upon which Wikipedia is built. There are numerous - and ongoing - cases in which individuals and companies have maliciously edited entries for personal gain. Data governance ensures that these changes can be detected and are subsequently reverted.

 

Term

Definition

Data provenance / lineage

Tracing movement (and changes) of data between origin and destination.

Data audit

Tracing of data modifications.

Data access audit

Tracing of data access.

Digital Rights Management

Mechanisms to prevent illicit access to copyrighted material.

Rights over data

Definition of parties that have permission and control of data.

Data usage

Permissions to read and/or modify data.

Data governance

Set of policies and rules that ensure a certain level of data quality.

Table 1: Quick Reference of Essential Concepts

 

Data Provenance
Kredyt na zdjęcia:
Support Centre for Data Sharing

W przypadku pytań i komentarzy odwiedź nasze forum na temat Futurium.