Aller au contenu principal

Data provenance & lineage: technical guidance on the tracing of data - Part 3

<< Read part 2 of the technical guidance on data provenance and lineage

5 Applications for the tracing of data

The tracing of data can be enabled via a number of different applications and approaches. This section presents five emerging and established solutions to the challenge of recording data provenance and lineage.

 

5.1 Workflow engines

The Workflow Management Coalition (WFMC) defines workflow as a “sequence of steps involved in moving from the beginning to the end of a working process”1. This rather generic definition also applies to data processing, a process which pieces of information are moved through a system for analysis and/or manipulation.

Consequently, a workflow management system “defines, creates and manages the execution of workflows through the use of software, [..] interacts with workflow participants and, where required, invokes the use of IT tools and applications”2. In such a system, a workflow engine describes software systems designed for computing customizable, consecutive data transformation steps. The WFMC defines a workflow engine as a “software service [..] that provides the run time execution environment for a process instance”3. As such, workflow engines are responsible for executing a workflow’s individual steps and thus a core component of workflow management systems (WFMS).

As in any scenario involving data manipulation by multiple parties, workflow engines can also incorporate data provenance measures. In workflow engines, the term traceability can be applied not only to the data that is being processed, but also to the order of invocation of the individual processes4. The seemingly natural way for implementing traceability, i.e. using logs, is agnostic to the underlying technology and can thus also be used in workflow engines.

However, depending on the nature and amount of data that is being processed, as well as the number of steps involved, the amount of provenance information that is being generated can quickly become overwhelming. Moreover, different users may be interested in different aspects of provenance. These two issues can be tackled by so-called “user views”, in which each user can individually select the provenance information relevant to them. Such functionalities are offered by provenance frameworks5.

Workflow systems are special in that they satisfy the use case of running a given set of computations multiple times, with only slight variations. This makes sophisticated provenance tracking, which incorporates relationships between inputs, intermediate results, and outputs, a vital feature6. The key challenge here is to determine how results have been derived, which should enable users to recreate the workflow that produced said results.

Multiple frameworks and concepts have been developed for implementing sophisticated traceability7. The open source workflow engines Kepler Project8 and Apache Taverna9 both ship with modules providing provenance functionality. Additionally, vendor agnostic models for storing provenance information, like those generated by workflow systems, have been proposed, for example Open Provenance10.

 

5.2 Identifiers for data

As discussed in section 3.1, persistent dataset identifiers can be a desirable solution to ensure that the provenance of data can be tracked. Persistent identifiers are typically unique sequences of characters, i.e. a referenceable ID, assigned to each dataset. One concept for ID generation is the Digital Object Identifier (DOI), specified by ISO 2632411. It was developed and is administrated by the International DOI Foundation (IDF)12. Instead of being a random string of characters, DOIs follow the schema 10.XXXX/ID. The placeholders marked with an X stand for the identifier of the publisher that is at least four digits long. Originally intended for online articles of scientific journals, the integration of the publisher provides a reference to a document’s origins. Unlike ISBN numbers, the DOI system does not feature measures to ensure integrity.

However, the DOI system is more than a schema for generating identifiers. It furthermore features several mechanisms that foster the practical interaction with DOIs. One key component is the resolver. It is backed by the so-called Handle System, which allows for persistent referencing of DOIs, even if the underlying resource is changed. This sets it apart from systems like Uniform Resource Locators (URLs), which can include information about specific protocols and parameters required for accessing a resource. The eponymous “handle” is responsible for connecting resource-agnostic DOIs with the respective resolver. This approach makes central registration agencies mandatory, also preventing collisions of identical DOIs being assigned to two different resources. One such agency is the aforementioned IDF and the Publications Office of the European Union13. Each DOI is also a Uniform Resource Identifier (URI) resolvable via dedicated services, for example the one provided by the IDF. Most of these services allow resolving DOIs as URLs by prefixing them accordingly, for example with https://doi.org/. As such, the INSPIRE Geoportal maintained by the European Commission, which also utilizes DOIs for referencing their published documents, list resolvable URIs as a best practice14.

In closed systems uniqueness of DOIs must obviously only be ensured within the limits of the system. However, if releasing data to the public is anticipated for a later point in time it may be wise to adopt an according strategy from the beginning.

Different tools exist that make working with DOIs easier. For example, browser extensions allow resolving DOIs extracted from web pages without manual intervention as well as storing them for later reference15 16.

 

5.3 Distributed Ledgers and Blockchain

Blockchain and Distributed Ledgers have promising properties to securely track provenance information in multi-stakeholder environments. They are best suited for cross-organisational and global use-cases, where data integrity may be a substantive concern and the need for a single source of truth may be crucial. The terms Blockchain and Distributed Ledger are often used interchangeably. However, Distributed Ledger can be understood as a more general term and Blockchain as a specialized form of it.

A Blockchain is an ordered list of blocks where each block holds a finite list of transactions. Those can hold arbitrary data and are digitally signed by their initiators. An approved cryptographic hash function is used to create a digital fingerprint of the block content. Each block contains the fingerprint of its predecessor and thereby a chain of blocks created. Modifying a transaction in between would invalidate the signature of the transaction, the fingerprint of the block and all fingerprints of subsequent blocks. The blockchain is replicated among several machines and the true state of the Blockchain is found by a consensus algorithm. Therefore, the Blockchain is an append-only register for transactions that provides immutability under normal circumstances.17

The use of Blockchain makes sense when entities need to share data but at most marginally trust each other. Ideally, the technology is used as a decentralized fault-tolerant register that is tamper proof, provides data integrity and transparency.

In general, a Blockchain System consists of a network of machines, a Blockchain that is replicated across those machines and a network protocol that defines rules after which the system operates. Rules are defined in terms of rights, responsibilities, means of communication, verification and validation of transactions, consensus mechanisms, authorization and authentication, mechanisms for appending a new transaction, incentive mechanisms and type of data that a transaction can contain.18

Blockchain systems fall into four types:

  • In a public permissionless Blockchain system anyone is free to join and has rights to read and write the Blockchain.
  • In a public permissioned Blockchain system anyone is free to read the Blockchain but only an authorized group can write the Blockchain.
  • In a private permissionless Blockchain system both read and write operations are only allowed to an authorized group.
  • And in a private permissioned Blockchain system an authorized group can read the Blockchain but only a subgroup is allowed write the Blockchain.19

Public permissionless Blockchain systems, like Bitcoin20 or Ethereum21, usually provide a finite cryptocurrency that can be traded by transactions. In such systems each transaction costs an adaptable fee that is used to pay the block creators. The fee amount is usually determined by supply and demand. This incentive mechanism, driven by scarcity, is used to ensure that the Blockchain system is making progress. Other types of Blockchain systems typically have authorized entities that are known and inherently interested in maintaining the functionality of the system. Thus, no incentive mechanism for block creation may be needed. In Blockchain systems, like Hyperledger Fabric22 or Corda, usually no native cryptocurrency is implemented.  Typical examples of Distributed Ledgers are IOTA23 and Corda24.

A Blockchain can be used to store any type of a digital asset. Usually, a digital asset contains a unique identifier and associated data. An initial transaction creates the asset and stores it in the Blockchain. Additionally, the signature of the transaction links the owner to the asset. Modification of ownership or the associated data itself are recorded in subsequent transactions. Those changes may be allowed or not allowed to others depending on defined rules regarding that asset. However, the Blockchain was not designed to store large amount of data or to query data. It is more suitable to store a state and the progress about it. Therefore, there are various reasons to reduce the amount of data that is stored in the Blockchain. For instance, in a public permissionless Blockchain system, the amount of data that a transaction can hold is limited and a fee is charged for each transaction. Therefore, it might be reasonable to only store a hash of the actual data. Then, the data itself is stored in a separate database, and modification can be excluded by comparing the hash that is stored in the Blockchain with hash of the data that is stored in the separate database.

A Blockchain can be used to secure provenance information of data. For this purpose, a data provider creates a transaction that logs the identifier, the hash of the associated data, and the database location in the Blockchain. The transaction is signed by the data provider and, thus, provides information about the source of this data. Any movement of this data can be logged by others in subsequent transactions. These would refer to the same identifier and provide a different database location. Additionally, changes in the hash value would indicate that the data was modified.

In the literature, several use-cases are presented that use Blockchain to secure provenance information of data. The combination of Radio-frequency Identification (RFID) chips and Blockchain is proposed to record state of a products lifecycle across different stakeholders during a supply chain25. Moreover, securing sensor data of Internet-of-Things (IoT) devices via Blockchain can be supportive for forensic analysis of digital data26. In the context of autonomous driving, manufacturer, maintainer, and the car itself can store sensor data in the Blockchain. In case of an accident, the car owner would issue a liability claim and the insurance company can query the Blockchain for provenance information and forward the liability claim to other stakeholders depending on the evidence.27 Moreover, Blockchain can be supportive in securing cloud computing environments by auditing all operations regarding data in a tamper proof manner28. In practice, there are only few real-world examples that use Blockchain to secure provenance information about data. In their product called TradeLens, IBM and Maersk use a Blockchain system to, among other applications, track container ships in a global supply chain29. Blockchain also found its use cases in the world of open data. For instance, the Ethereum Blockchain is used by Open Government Data Vienna to store the hash and identifier of data records so that data consumers can check whether data has been altered30.

With the recent hype tailing off and as a technology that must still prove its applicability in real-world settings, Blockchain will remain on the technology agenda for many years to come. It is now a trending research topic, especially in the field of provenance tracking. However, real-world examples that go beyond the financial sector are urgently needed to show the full potential of Blockchain.

 

5.4 International Data Spaces

Tracking of data provenance in a multi-organizational domain is firmly connected to the objective of enabling trustworthy data exchange and ensuring data sovereignty. The ability to track movement of data between companies and detection of potential unlawful transfer of data is essential to the enforcement of rights over data. The following chapter describes an initiative to build such multi-organizational environments for data sharing.

The International Data Spaces (IDS)31 32 initiative is a joint effort between industry and research to design, implement and establish an architecture for trustworthy data exchange for the data economy. The International Data Spaces initiative was founded in 2014 as a cooperation between the Fraunhofer Gesellschaft as research partner and the International Data Spaces Association as not-for-profit user association. The main goal of the IDS initiative is to create an environment enabling data sharing between companies and provide the framework for trusted and secure data exchange in business ecosystems. In order to provide such an environment, the IDS aim to meet various requirements including “trust”, “data sovereignty” and “standardized interoperability”.

Trust as the driving idea behind the International Data Spaces is ensured by evaluation and certification of all participants before admitting them to the trusted network. Data sovereignty describes the self-determined exchange of data. Data sovereignty is empowered by allowing data publishers to attach usage restrictions to their data offers and ensuring that a data consumer fully accepts and enforces the given restrictions. Standardized interoperability ensures that each component participating in International Data Spaces can communicate with the other components by implementing standardized protocols regardless of the developer or vendor of said component.

In that sense the IDS is a peer-to-peer network between certified participants with the purpose of sharing data under predetermined terms and conditions. The network consists of IDS connectors acting as gateways to the network. Data is published and consumed through connectors and usage policies are enforced when data either leaves or enters such connectors. Further International Data Spaces consist of a variety of other components, performing different tasks to enable the core promises of the IDS, such as identity providers, metadata brokers and a clearing house.

Component Infrastructure

Figure 2 Component infrastructure and data flow

The metadata brokers’ task is to ensure discoverability of connectors and data offers. IDS connectors are therefore encouraged to register to a metadata broker in order to make their data offers available to a broader target group or query a metadata broker to retrieve information about the data offers available in the IDS. The identity providers provide technical implementation of “trust” by confirming the connector’s compliance with certification. Connectors check the trustworthiness of the opposite Connector before initiating or accepting data transfer with the identity provider. The clearing house33 can be involved by the connectors to perform a number of tasks, such as auditing and logging of the performed transactions, billing and invoicing of data transactions, discharging of transactions, and tracing data provenance.

As stated above in International Data Spaces, the clearing house takes the role of a mediator between data consumers and data providers. It serves as a trusted partner during data exchange between partners who do not trust each other outside the IDS environment. The IDS clearing house mitigates the risk of data exchange partners not enforcing usage control policies – and, thus, failing their contractual duties.

Connectors can use the clearing house’s functionality before or during data exchange to clear legal, financial or technical questions. These functionalities may be the control and release of data during exchange to make sure that all financial and legal conditions are met or ensuring that the data consumer cannot deny having received the data. Another important role of the IDS clearing house is to log transactions it is involved in to ensure auditability and traceability.

IDS provides various possible ways to track providence and lineage of data and can therefore be held as an exemplary approach to implement data traceability in practice. The ability to trace data across the network of companies, which is International Data Spaces, is strongly linked to the IDS goals of ensuring data sovereignty in the sense of monitoring to whom and under what conditions data has been transferred. To ensure such traceability functions, provenance tracking can be implemented within the IDS connector by inserting local tracking components. Alternatively, a central provenance storage component can be added to the IDS clearing house, which already provides logging functionalities. The IDS clearing house receives logging information about all activities performed during data exchange and confirmation of successful dispatch and reception of data

International Data Spaces contains the necessary components to implement data traceability. Up to now, however, it is up to the participants to utilize this potential. The IDS does not force participants to use those concepts and standards and it is up to the participant to provide correct provenance data. Furthermore, the IDS reference architecture model does not dictate the technical means by which components should be implemented. The technologies described in section 5.3. could therefore also be applied in the context of International Data Spaces. Central storage capabilities in the IDS clearing house, usage control enforcement or a local provenance tracking unit in an IDS connector could be implemented using blockchain technology34.

 

5.5 The Web

Data provenance on the web is of major relevance due to the massive increase of data generated these days. Especially because the web can only be viewed globally, it is important to establish and use a common standard. This section highlights how data provenance can be achieved and implemented in a global context.

Not only for the web and its digital documents data provenance does play an important role, but the topic has also arrived in the analogue world. Especially the concept of Industry 4.0, which aims to establish production chains and manufacturing processes increasingly without human involvement, bears close references to data provenance. Many production and logistics processes are to be performed automatically by machines, requiring confidential data along the way.

Data provenance records are no more than metadata that describe the actual data (e.g. documents or web pages). The focus is on metadata, which underlines the confidence and validity of the data. The fundamental idea is formulated by five questions in order to achieve the mentioned level of trust:

  • Why was the data produced?
  • How was the data produced?
  • Where was the data produced?
  • When was the data produced?
  • By whom was the data produced?

The World Wide Web Consortium (W3C) has taken up this challenge and developed the W3C PROV35. In essence, the recommendation provides a specification to address the above questions36. It is a family of documents that define a model and add additional serialization capabilities, as well as providing a set of definitions to make data provenance37. In addition, the existing web technologies XML and RDF are used to achieve the best degree of interchangeability.

The core of this recommendation consists of a data model (PROV-DM), defining a common data provenance vocabulary38. Here, the W3C has also identified different practices of data provenance, because each organization or person implements data provenance differently.

 

Figure 3 High-level structure of W3C PROV

 

Figure 3 shows a high-level overview of the W3C PROV structure. There are three central units - entities, activities and agents. Entities are usually the origin, they stand for documents, a diagram, a piece of software. They can also refer to other entities they are related to. When content or characteristics are copied from one entity to another, they are derivatives of each other.

Activities stand for the dynamic processes and actions that contribute to the fact that entities change over time. For example, new versions can be created, or translations of text can be written. Thus, new entities are always created by the activities.

Agents are the roles within activities. These can be natural persons, but also a computer program, an entire organization or any other object. They are always bound to activities; they have a corresponding responsibility. The same applies to the entities and the agents. A role in this context is a description of the person or function39.

One of the most important factors is time. Data changes over time. Changes are addressed in a very detailed way in the W3C PROV recommendation and are recorded separately for each unit. Further examples of the application can be found in a primer on the key examples of the W3C PROV recommendation.40

A demonstrative example is the "ProvStore" project41. This is a public repository for documents, which is the first to completely implement the W3C PROV recommendation. A user can publish documents together with provenance data. In some cases, it is also possible to release them only for certain registered users.

Based on this, other users can very clearly understand the data and its provenance. In addition to the actual document description in text form, there are also various visualizations that document the temporal progression of changes. Moreover, there are also functions for the transformation of the data provenance and export options in different serialization formats, which were also part of the W3C recommendation.

The ProvStore is based on Python and the web framework Django. The development also uses a special Python package called »prov package«. This Python package is a full implementation of the W3C recommendation and can be used under the MIT License42.

Data Provenance
Crédit d'image:
Support Centre for Data Sharing

Pour des questions et des commentaires, veuillez visiter notre forum sur Futurium.