Tracing data back to its origins, i.e. identifying its provenance, or examining how data is used through its lifecycle, i.e. understanding its lineage, is of interest in many areas and use cases. This section summarises a few important examples where the tracing of data is either an inherent challenge – or where its targeted application can help resolve other challenges.
Depending on the use-case it may be desirable to track data during processing. Especially as systems become more complex it may be important for users to be able to identify and authenticate data assets and authorised users unambiguously. In order to achieve this identification, suitable solutions are required to make data identifiable and to authenticate resources.
Generally, identifiers should be unique, resolvable, interoperable, and persistent. 1 One simple way is to assign a random Universally Unique Identifier (UUID) for each dataset, which, depending on the version used, is virtually collision free. An alternative method is to use Digital Object Identifiers (DOI). Both concepts are explained in detail in section 5.2. Another important aspect to consider when processing data in multiple instances is hardening against unauthorized tampering. It is important that the origins of changes can be tracked unambiguously.
Both requirements, i.e. integrity and authentication, can be fulfilled using digital signatures. This concept is based on asymmetric cryptography, implemented for example by the Rivest-Shamir-Adleman (RSA) algorithm. This means that in a given system each actor generates two keys, a public and a private one. These act as complementary secrets, meaning that data encrypted with one can be decrypted with the other. As the names suggest, the private key must not be shared, whilst the public key is intended for sharing. Depending on which key is used for encryption different goals are achieved. In case of digital signatures data is encrypted with the private key. This means that is can only be decrypted with the corresponding public key. Using a different key yields meaningless sequences of data. This ensures authenticity. Mischievous parties intending to alter data can decrypt digitally signed data, but as they lack of access to the original private key, they cannot recreate a legit digital signature. This ensures integrity. The concept of digital signatures is illustrated in Figure 1.
For the mechanism outlined above to work, public keys must be trusted. For this purpose, a system called Public Key Infrastructure (PKI) has been established. A PKI allows issuing certificates of authenticity to verified public keys. This can be achieved in two ways. One is a centralised approach, in which a select number of companies or NGOs are trusted to issue respective certificates for public keys. These institutions are called Certificate Authorities (CA) and offer services which issue X.509 conformant certificates to applicants. Notable examples of such authorities are IdenTrust and Let’s Encrypt. The other approach goes by the name “Web of Trust”, denoting decentralised system for ensuring authenticity of public keys. The core concept here is that transitive chains of trust are established between users. If A trusts B, and B trusts C, then A can also trust C.
Instead of using digital signatures certain media types can also be fitted with a digital watermark. However, this can only ensure authenticity, not necessarily integrity of data. Also, blockchain technologies, as explained in section 5.3, can provide authentication and integrity.
Either technology is fit for use with both data and metadata, since the required procedures are fundamentally the same. Typically, data is hashed before signing. This serves the purpose of normalizing the input. If a cryptographic, collision-free hash algorithm is chosen, for example SHA-3, this does not lessen security.
In the digital age, contracts play a role that is just as important as in analogue times. Every trade or transaction is based on a contract between two partners - and often a middleman. This middleman represents a trustworthy partner who handles the transaction for both sides. In this way, the two partners do not need to worry about possible fraud, as the middleman is responsible for safeguarding the transaction.
New trends in the blockchain segment allow the use of smart contracts. These completely replace the previously mentioned middlemen by computer protocols. Contractual terms are replaced by a technical implementation and formulated as a sequence of logical conditions. These terms and conditions are defined in a blockchain and can be checked and verified permanently.
The two parties of the transaction define the rules of the contract themselves and accept it. If a payment is to be exchanged, this is attached to the contract as well. The smart contract autonomously checks whether the previously defined contractual terms have been fulfilled or not. If the conditions are satisfied, the contract is deemed to be fulfilled and payment is made to the recipient partner.
This approach offers the advantage of immutability. Once the contract is finally defined, it cannot be changed. There is, however, the possibility to add a new contract and referencing it to the existing contract in the blockchain. This allows actual valid conditions (in the form of algorithms) to be extended or replaced.
Since the blockchain is used as the basic technology, all contracts are handled in a decentralised manner. Furthermore, they are self-executing, accurate, fast, and cheaper than middlemen. All algorithms and implementations are also open source, so that they can be evaluated by external persons or experts regarding security and vulnerability. Currently, Ethereum is one of the most popular platforms for developing smart contracts 3 . Smart contracts can be used in many settings, such as financial transactions or the management of digital rights, data on clinical studies 4 , administrative processes or elections.
As described, only the rules for the execution of the agreement are contained in a digital contract. The application that manages smart contracts needs additional information to execute the internal logic of the contracts. This requires the integration of external data. For this reason, the data must be traceable and trustworthy, because smart contracts should only be considered fulfilled if the defined event has truly occurred.
Another important aspect is to regulate the reuse of data transferred or shared based on contractual agreements, mostly via licences. In today's IT landscape, individuals, companies, and organizations, upload their data to cloud services, making it easier to share such data with specific individuals. Alternatively, people use cloud solutions to enable data exchange between two or more systems.
Cloud providers can usually evaluate this data, and if the data is shared with third parties, at least, the owner completely loses control of it. Creative Commons licences simplify data reuse if relatively specific conditions are met. But the presumption of a relatively liberal reuse regime does not nearly cover all use cases of users who seek to share data today. Hence, the all-important important question remains: How can the terms of licences be ensured once the data has been passed? A mechanism is needed to enable the digital assets' permission rules to be protected. This might be possible through data provenance records.
The use of data provenance for this use case requires that there must be no separation between data and data provenance to guarantee consistency. The properties of data provenance records should also include the conditions for further use. After all, the data can only be used in a trustworthy manner. If inconsistencies occur or information of the Data Provenance, including the licence information, is missing, a user can assume that the licence information is not original.
A practical application are encrypting hash functions that calculate a value on the basis of the original data. With each further transformation or distribution, the value should be included again in the respective recalculations, creating a provenance chain. And likewise, it can be ensured that the licence conditions are always with the data to meet consistency requirements. Related hash tables can be used to obtain information on reuse. An advantage would be that the data does not gain too much memory size through provenance information.
Data protection laws do not usually specify explicit requirements to maintain data provenance or lineage records. However, the EU’s General Data Protection Regulation (GDPR) specifically dictates duties on behalf of data collectors and processors that may, in practice, be served well through data provenance and lineage records.
Pending some minor exceptions 5 , article 30 of the GDPR obliges data controllers and processors to maintain so-called “records of processing activities” - in short “processing records” - of personal data. Generally, these records ensure transparency with regards to the processing of personal data based on consent given by individuals; if nothing more, they provide legal protection against any claims of unlawful data processing. But overwhelmingly, these registries are kept as registries of data processing activities, i.e. they are not necessarily specified, or even attached to data sets or even individual data points.
However, the latter may specifically help data controllers better comply with two standard requirements of data protection laws. First and foremost, data controllers are obliged to ensure that data is processed only in line with consent given, including any sharing of data with potential data sub-processors. Information on consent details in data provenance and lineage records can help to ensure such consensual processing of data. The second standard privacy provision not only found in the GDPR stipulates that data must be deleted – or at least stripped of any personally identifiable information - if the processing purposes have been achieved or when a contingent expiration date is due.
Additionally, the GDPR also grants data subjects rights for data access (Art.15), data rectification (Art.16), data erasure (Art.17), restriction of processing (Art.18), and data portability (Art. 20). Based on these articles, data subjects may at any point in time ask data controllers to:
- access data held by a data controller;
- correct erroneous data records;
- delete data records (thus, exercise the “right to be forgotten”);
- restrict how data is processed; or
- transfer data to another controller without hindrance.
A speedy, efficient, and thorough compliance with such requests eventually depends on whether a data controller – and any relevant data (sub-)processors – can easily identify and cleanse relevant data from their data bases and other systems. E.g. to satisfy data access and, similarly, data portability requests, data controllers must be able to know where data is stored, must ensure that the relevant data is accessible, and can be communicated or be extracted for the requesting user. If a data subject seeks to rectify incorrect data or demands the restriction of processing, a data controller must also be able to quickly establish e.g. in which processes said data is used.
Beyond the creation of processing registries that focus on processing activities, no standard practices have emerged to optimise data provenance and lineage recording in light of GDPR. However, recording for example any information relating to data processing compliance (such as consent conditions, allowed duration of processing, etc.) as part of data provenance records may be one solution to further improve data processing activities. For this, refinements of data models used by organisations of all kinds may be required. Adding such information to data records, ideally building on the W3C-PROV specification (see section 5.5) and including data identifiers (see section 5.2) may enable an elevated compliance with data protection laws such as GDPR.
Data processing and analysis have become an integral task for many organizations, research institutions and corporations. This often includes the continuous cleaning, transformation and aggregation of data. The results of data analyses primarily support decision-making, new service generation, and the attainment of new knowledge.
The impact of modern data processing on organisational decision making is already high - and its relevance is continuously growing. Because such weight is given to data-driven decisions, it is paramount that data-driven services and decisions are reproducible and accountable. This is nowadays specifically true for decisions and scientific findings related to individuals, but will become more relevant in many more domains and applications in the future. Decisions will shift more and more from being based on human intuition to being based on data. This change has wide ranging implications for legal and moral liabilities.
One example are data-driven public administrations and governments that can be held accountable and, thus, must be able to explain the rationale for individual decisions and general rules. Similarly, private companies that must follow specific laws and regulations may have to prove how they arrived at decisions in line with regulatory requirements. These issues become even more complex when third parties reuse data products and create subsequent artefacts. In each of these cases, data processing and analyses must be no black boxes; instead, their mechanics must be auditable. Accordingly, data processing and analysis systems and setups should keep a detailed record about the flow of data and produce clear lineage statements. Ideally, this implies the tracking of various information, e.g.:
- the origin and provenance of source data;
- a trail of the applied data operations;
- a timestamp and information on the responsible person(s);
- a record of different or intermediate versions of the data.
Furthermore, it is desirable that the lineage record is immutable and technically secured against manipulation, which will increase the accountability even more (see section 5.3. distributed ledger and blockchains).
However, the practical implementation of such records poses big challenges, both technically and organisation-wise. While it is undoubtedly challenging and resource consuming to approach the implementation early, e.g. when designing new data pipelines and similar systems, a retroactive application can be even more challenging or close to impossible. A broad variety of data processing and analytics solutions exist: ranging from very basic, scripting-based approaches, including Excel, to more advanced local setups, like RStudio or Jupyter Notebooks, to data-intensive computing platforms running on server clusters, like Hadoop or Spark. These options are constantly growing and new products are emerging. Assuming that provenance pipelines will have to work across various tools and environments, this will make reproducibility and accountability even harder in the future.
However, until now, the developers of such products pay scant regard to lineage and traceability. 6 Most tools do not offer out-of-the-box solutions, increasing the need for tailored, more expensive solutions. Nevertheless, some lineage and provenance-enabled products or add-ons exist: Some examples are CamFlow 7 for provenance capture on operating system level, noWorkflow 8 for Python scripts, RDataTracker 9 for R code, RAMP 10 for extending Hadoop with provenance and Apache Airflow 11 with support for lineage capture for workflows.
A decisive criterion for the selection of technology in future data-driven projects should thus be the availability of such tools. Additionally, it is indispensable to precociously initiate a structured discourse in organizations on how to achieve reproducibility and accountability of data with future and deployed technologies. Such developments should lead to the implementation of organizational and technical guidelines, the designation of dedicated personnel, and intra-organisational cooperation along the data processing chain. Perpetuating this process is also critical in order to ensure an ongoing capacity to comply with emerging legislation and requirements.
In the past few years, the evolution of artificial intelligence has also brought a twist to the evolution of data driven decision making: While the behavior of information systems was in the past well defined by human-developed algorithms, modern information systems increasingly base their decisions on data without needing a pre-defined set of rules. The process for deriving the basis of these decisions from data is commonly referred to as machine learning. There are various machine learning (ML) methods that represent mathematical models adjusting a set of function parameters over time depending on the presented data to create a ML model.
These models and the decision they produce are hard to explain because the desired behavior is achieved by adjusting parameters without providing any explanation about what these parameters actually represent. From a data provenance and lineage perspective, this is highly contentious as data is effectively tinkered with in ways that are very hard to track, trace, and explain to humans. As a result, resulting models are eventually black boxes, making it difficult for external inspectors to comprehend decision making processes. This also renders almost impossible any reliable certification of such systems.
There are however numerous approaches to explain data-based decisions. These can be grouped into four categories 12 :
- Model Explanation: An explainable model mimicking and approximating the black box’s behavior is provided. Humans can than interpret the deputy model to explain the decisions made by the black box.
- Outcome Explanation: The black box does not only provide the outcome of the decision-making process, but also outputs an explanation for the current calculation.
- Model Inspection: The explanation for the decision is derived by analyzing the relation between input parameters and output.
- Transparent Box Design: Explainability is already considered during the design of the model.
Further Human-in-the-Loop methods describe the approach of utilizing a human to validate or tutor the ML model during the training process or even correct the explanations provided by the model in case of a model that can provide its own explanation for the decision 13 .
In some cases, there are specific ways to interpret an ML model, i.e. for decision trees, decision rules, linear regression, or logistic regression. In other cases, model-agnostic methods such as Shapley values, Saliency Maps, Partial Dependency Plots, or Individual Conditional Expectation must be applied 14 .
While the described approaches to explaining ML models provide means to retrace model features that led to a certain decision, these methods can only provide an approximation and tend to further subjective perceptions. As a result, hard criteria of certification cannot be met and any explanation might remain flawed or insufficient when presented an outside audience.
Beyond the specific use case, the type of system in which data tracing must be realised is naturally a decisive factor for the implementation of any provenance or lineage solution. This section explains the implications of data provenance along three dimensions: local systems, intra- and cross-organisational systems, and global systems.
One of the most obvious differentiating characteristics of any technical system – not only of a computer system – is its user community: It may consist of a single (nearly) fixed user over the whole lifetime of the system, of a group of “well-known” users, or of a potentially very big group of – more or less – anonymous users. Examples are (1) a smartphone and its user/owner, (2) the computer network of a company and the employees registered for network access and use, and (3) a publicly accessible information exchange forum that requires no registration.
Another dimension is administrative control: Single user systems may be administratively controlled by the user him-/herself, but there are also configurations where control is delegated to a third party, e.g. when a system is enabled to access confidential company data. Sometimes, the members of a “well-known”, trustworthy user group may be given administrative control over the commonly used system, but typically such systems are controlled by dedicated personnel. By default, anonymous users cannot be trusted, thus systems with anonymous or just very big user groups must be third-party controlled in general. In either case, third-party control can further be classified as being set by a single organisation or as being based on an agreement or a contract between organisations.
At first glance, the geographic reach of a technical system could also be a differentiator. But upon closer observation, this is nearly irrelevant today – particularly when compared to the user and the control perspective: In times of virtualisation and “xAAS”, components that are distributed worldwide can form a single-user system and globally distributed components in a virtual private network can be strictly controlled by a single organisation. Conversely, a single, centralised system can serve a huge number of globally distributed users via the internet.
For any kind of system, both organisational and technical data lineage measures are required and must be regarded in common. The following sections explain the implications of different archetypes of system, including:
- centralised systems under “local” control of one user;
- intra- and inter-organisational systems that work either within large, distributed organisations or between different organisations: and
- global systems that serve anonymous users.
4.1 Local systems
Processing, i.e. creating, storing and/or transforming of data on a (local) stand-alone system – that may nevertheless consist of several interconnected physical and logical components – nowadays typically takes place under the control of a single person that changes only under special circumstances. 15 This person has at least two roles in parallel, as the process owner of all productive data handling processes and as the administrator (“super-user”), configuring, managing and monitoring the stand-alone system. In its strictest form, storing and processing data on local systems with highly controlled access is not overly common today. However, one typical example is the handling of sensitive data in medical research.
If the person is sufficiently skilled in data processing and system administration, she will exactly understand the specific data lineage requirements at hand. She will also be able to configure productive processes and the (universal) underlying system in a consistent and secure manner according to the lineage and provenance requirements.
Even though that approach may not be the most efficient under all conditions, it bears a relatively low risk of missing the recording of relevant lineage data or to accidentally create a bypass to lineage data recording. Assuming that the person in question is interested in correct data handling, there is also a relatively low risk that the person intentionally corrupts the data and its processing or the lineage data and its processing. The same holds for the intentional bypassing of lineage data recording.
As stand-alone systems are closed by definition and should be configured accordingly, there are in general fewer possibilities to break into such systems. There are, thus, also fewer risks that data, lineage data, or any corresponding processing are intentionally (or accidentally) corrupted by third parties. Typically, there is also a lower risk that data gets illegitimately copied (with or without lineage data) and exported to external systems or data storage devices.
Securely configured stand-alone systems should and generally can have only a small number of well-defined interfaces that are as simple and specific as possible. This makes it easier to monitor various data transfers (productive data, transformation programs, transformation parameter data, etc.) between the stand-alone system and its environment. This makes it easier to detect any attacks on the data concerned or its processing means.
Individual (stand-alone) systems typically have an effective data set (i.e. file) access control management capability built into the operating system. This makes it possible to protect input data sets and their lineage data against accidental or intentional modification or destruction.
For some kinds of processing, the challenge of data lineage – recording huge amounts of metadata concerning the origin and the processing of the productive data – can partially be relaxed (at the cost of inverse processing) when both the original/input data and the processing results stay “together”. In principle, this automatically applies to stand-alone systems. But limited data storage capabilities may hinder this alternative.
System management, application management, and data management are complex and complicated tasks which are all worth handling by an expert. Even if the user of a stand-alone system is well educated, she might miss actual developments in some areas. As a result, she may select suboptimal measures (in terms of efficiency and/or effectiveness) given the requirements of the specific task at hand, but even more so for the requirements of future users along the data processing chain or for the requirements of initial data providers requiring a specific handling of data. Depending on the utility of the processable data, the risk of attempts to physically access a local system should not be underestimated and corresponding security measures should also be established.
Limited data storage capabilities and/or limited data processing capabilities may restrict data lineage metadata recording. Limited data processing capabilities may be specifically restrictive if any results must be produced in short order. Limited data storage capabilities may require that the original data is deleted during processing. This can even occur if the origin data is not overly big – but must be ran through many iterations of a model, each occupying storage space. In this case, data lineage metadata inherited from original data must explicitly be coupled to the processing results (not just by links).
The major characteristic of both intra- and cross-organisational systems is that they can be subjected to a strict governance and/ or policies binding both control personnel (i.e. administrators) and lesser users. For an intra-organisational system, this policy is set by the organisation. 16 Cross-organisational systems are either run according to the policy of one of the involved organisations, e.g. the hosting organisation, or according to a dedicated policy explicitly established to the interoperation of the organisations (e.g. as part of a project contract). In either case, cross-organisational systems require some explicit policy, particularly for liability reasons. Therefore, a single-user system administered and ran according to a (cross-)organisational policy is categorised as an intra- or cross-organisational system here.
From a component point of view, a cross-organisational system may be spread across the participating organisations or it may be hosted by just one of the organisations. Even if this does not imply any differences in the functioning or access of users, data lineage recording requirements may differ depending on e.g. the security requirements of data processed. In various cases, any transfer of data across organisational borders must be recorded.
Users of intra- or cross-organisational systems – according to our definition – are bound in some organisational way (e.g. as a member or employee) to at least one of the organisations concerned.
Because users are organisationally bound, organisational rules – including those concerning data lineage – can be enforced, e.g. by a general contract with users or task-specific additional agreements.
Organisations establishing data lineage policies (or having to fulfil external data lineage requirements) typically nominate data managers and provide them with at least basic data management skills. If the knowledge of such personnel is maintained at an appropriate, up-to-date level, staff can act as both a gatekeeper and a multiplicator towards users. Data managers typically have a better overview of data lineage necessities than data users. Therefore, they can establish or initiate more effective data lineage measures and can raise awareness among users for organisational data lineage measures. These can for example mandate that user-specific data processing sessions should not be used by other users for convenience purposes - not even briefly.
Often, systems spread on intra- or cross-organisational resources can better be tuned to timeliness or memory requirements. Therefore, both original/input data keeping and lineage metadata keeping and recording may be easier than on local systems.
System administrators are typically skilled in general administration tasks, but often lack knowledge about productive application processes running in a system environment, e.g. to handle and process data in a manner specific to certain business processes. Close coordination between the application, data management, and system administration side is, thus, at least required when new applications are brought into operation.
In general, intra- and especially inter-organisational systems are neither stand-alone nor single-user systems. Therefore, they require more elaborated measures to detect and avoid system and data manipulation and unwanted side inputs (accidental or malicious). To effectively shield against these, they also need appropriate user recording. Weak user recording measures can “invite” users to manipulate the system and/or the data as they expect not to be identified as manipulators.
Even though virtualisation techniques are increasingly applied to isolate critical tasks, especially medium and large organisations still use systems for multiple business processes at once frequently. For example, systems could be used for several productive applications in parallel or to run both development and productive tasks at once. Such systems require specific monitoring and handling of (possible) inter-application effects, e.g. resulting from programming errors or resource competition.
According to our definition, global systems are used either by anonymous users, by a group of users which is controlled with only simple means 17 , or by a group of users that cannot or is not intended to be controlled effectively. Even if the majority of users of such a system is bound to the organisation operating a given system, it remains a global system as the least binding (partial) setting defines the system type.
Most global systems are hosted and administered by a single organisation, but several examples of operational global, multi-organisation systems exist as well. One of the most famous global, multi-organisational systems is presumably the Tor Network, an anonymising and location camouflaging network. Curiously, its task is to prevent data lineage. Equally, the Web can be considered a global system.
Global systems may have worldwide relevance, like Wikipedia or like the systems of the big news agencies where the public can report news. But an app enabling the citizens – or just visitors - of a small city to report local potholes is as well part of a global system, typically operated by the local administration.
In terms of data lineage, a global system has no inherent advantages. But global systems can reduce the barriers for relevant data being recorded at all – be it messages about economic crime, natural catastrophes, or environmental pollution.
If no prior measures are taken, global systems by themselves have no means to enforce careful and compliant behaviour of their users. In many cases, there are also no practical legal means to discipline intentional misbehaviour.
In global systems, data lineage must be established “on top”, i.e. via witnesses, trust building mechanisms, quality supervision, qualified majority decisions and so forth. Generally, the system itself becomes the “origin” of such data, e.g. Wikipedia is referenced instead of the original writer of the corresponding article.
If a global system has no or weak quality supervision mechanisms, malicious users can easily and permanently establish fake data. If such lack of quality supervision is ignored and poor quality data is further processed by other systems, questionable data may become seemingly trustworthy. This can happen as such data is eventually spread by trustworthy systems, too. Additionally, human data consumers and even electronic systems tend to examine at most some, often only the latest steps in the data lineage chain for trustworthiness.
- 2By FlippyFlink - Combined changed the image https://en.wikipedia.org/wiki/File:Public_key_encryption.svg from encryption to signing., CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=78867393
- 5Exemptions apply to smaller companies with less than 250 employees and if no sensitive data is processed.
- 6KLEPPMANN, Martin. Designing data-intensive applications: The big ideas behind reliable, scalable, and maintainable systems. " O'Reilly Media, Inc.", 2017. p. 532
- 12Guidotti, Riccardo, et al. "A survey of methods for explaining black box models." ACM computing surveys (CSUR) 51.5 (2018): 93.
- 13Molnar, Christoph. Interpretable Machine Learning. Lulu.com, 2020.
- 14Teso, Stefano, and Kristian Kersting. "Explanatory Interactive Machine Learning." http://www.aies-conference.com/accepted-papers/. AAAI, 2019.
- 15Other configurations are classified as being intra-/cross-organizational or global (see sections 4.2 and 4.3), even if they take place on a stand-alone system.
- 16Even if an organisation has no explicit policy, some unwritten rules typically exist.
- 17E.g. via self-registration of email addresses with no advanced verification other than subscription confirmation emails.