Data sharing projects typically require accommodation of concerns of a variety of stakeholders.

These often include:

–    that where personally identifying information is disclosed and used through data sharing, that the use is for a purpose of which affected individuals had notice;

–    that the data user is able to verify that, where required (for example, in relation to health information), affected individuals have given informed consent to this disclosure and use;

–    information security, including guarding against threats from internal unauthorised intrusion and external threats including malicious attacks (denial of service, hacking etc.) and cyber-espionage;

–    keeping clarity as to who ‘owns’, maintains and is responsible for control of distribution of data (for example as to which core data sets may be transformed by cleansing, normalising, key coding, merged or other transformations or value adds, as to ownership and of subsequent use of transformational code, algorithms and inferences, insights, and reports derived from data analytics conducted on these data sets);

–    maintaining trust of citizens that information about them will not be used by government agencies or business enterprises in ways that are privacy invasive, ‘spooky’, contrary to accepted societal norms from time to time, or in ways that may lead to them suffering unfair adverse consequences;

–    complying with restrictions in contracts and in statutes; and

–    protecting confidential information and trade secrets.

#Fail

Many data sharing projects fail to proceed due to an inadequate framework to resolve privacy concerns.

To date, privacy concerns around data sharing have often been addressed by using ‘masking’ of identifiers, that is, the removal of personal identifiers and the pseudonymisation of data sets using transactor keys or tokens.

However, some privacy advocates assert that technological advances and multiplicity of data points make re-identification of individuals from pseudonymised data relatively straightforward.

In response, there has been extensive work in recent years in developing privacy protective risk management methodologies in order to specify appropriate and legally enforceable requirements for data linkage of data about individuals.

These methodologies may be employed to properly protect data sets such as card transaction records, geo-located movement traces and patient level epidemiological health data, that otherwise may be vulnerable under reidentification attack.

De-identification vs anonymisation

Privacy risk management turns on recognising a distinction between de-identification and anonymisation. The stages of removal or obfuscation of direct (name) or indirect (mobile number, movement trace, email address) identifiers of any individual included in a stream of transaction data can be seen as steps along a continuum of de-identification.

Effective anonymisation of transaction level information is the logical end point of that continuum.

Anonymisation means the data transaction information still addresses a unique and distinct transactor, but does not enable the individual that is the unique transactor to be identified, whether from the information itself or from any combination of data points reasonably available to any entity that has access to the data stream or its derivations.

Of course, de-identification to the point of anonymisation can often be achieved by aggregation of individual data points, typically for the purpose of making comparisons or identifying pattern, that is, to show general trends or values without leaving granular indirect identifiers that might leave an individual identifiable within the data.

Applying k-anonymity or like methodologies, values determined to be of ‘small numbers’ may be supressed to minimise risk of reidentification, either through blurring or through omission altogether.

Risk of identification

Sometimes it is possible to de-identify data to the point where the transformed data is safe for public release because there is no more than a remote risk of individuals being identified: in this case, the data has been effectively anonymised. Of course, once the data is released the full artillery of re-identification techniques may be employed on the data by anyone, so anonymisation must be particularly robust, including over time.

Unfortunately, the utility of effectively anonymised data for many purposes, and particularly for epidemiological applications, is severely compromised by aggregation, suppression or blurring.

In such cases, alternative measures must be taken that retain the usefulness of unique individualised data whilst still protecting the privacy of the individuals concerned.

Clearly useful individual level data cannot be released publicly, but re-identification risk associated with its use may be managed through pseudonymisation combined with controls as to access and application of that data.

This may be referred to as controlled (or safeguarded) release, only for use in a recognised ‘de-identification zone’. In the scenario of controlled release, the assessed harms from re-identification may be allowed to be higher than for data released into the wild.

Assessed risk is a measure of the extent of threat by a potential circumstance or event and so typically a function of both the adverse impacts that would arise if the circumstance or event occurs, and the likelihood of occurrence.

Therefore, controls deployed in the safeguarded data environment may substantially reduce likelihood of attempts at re-identification while in that environment. Physical, system, human and permitted output controls as the perimeter of the safeguarded de-identification zone may ensure that outputs from that environment are appropriately aggregated or otherwise privacy protected.

Thus, assessed harms from improper release of particular data sets from the safeguarded data environment may be high, but the risk that those harms will be suffered may be so effectively mitigated that assessed re-identification risk is remote.

In summary, data sets and data streams that would usually be considered too high risk to individual privacy protection may be managed within a properly planned, documented and implemented privacy management framework that can reduce re-identification risk to the point where this risk is remote within the particular context of controlled release and use.

A similar risk assessment methodology may be applied to both controlled access and public release data sets, to determine the point at which re-identification risk is sufficiently remote for the particular context of use.



Come to no harm

Two important consequences follow from assessment of risk requiring both assessment of harms, and likelihood of those harms being suffered.

First, where data is de-identified for limited disclosure or access, provided that disclosure or access has been appropriately, reliably and verifiably limited and controlled, re-identification risk will be significantly less than if that same data was put into the wild.

Second, likelihood of occurrence might be mathematically expressed using an objective scale. But because harms are likely to be quite specific to the circumstance and particular individuals, a fact specific, contextual analysis is required.

In any event, there is no regulatory clarity as to how a ‘low’ or ‘remote’ point of likelihood of occurrence is to be objectively and statistically measured, so assessment of risk has an inherently subjective element.

That is why privacy impact analysis remains an inexact science, or as some information security experts see it, something of a black art.

In any event, a privacy impact assessment should be conducted in relation to any project involving use of purportedly de-identified data which carries any reasonably ascertainable risk of re-identification of any affected individuals, both as to risks within the safeguarded data environment and as to outputs from that zone which themselves might be personally identifying.

We have already noted that perimeter controls around a controlled environment may ensure that released outputs from that environment are appropriately aggregated or otherwise privacy protected. This aspect of safeguards requires particular attention in order to ensure that permitted inferences, insights and reports derived from data analytics conducted on the safeguarded data sets do not leave any individual reasonably identifiable, and otherwise protect the underlying data in accordance with expectations and requirements of the contributor data custodians.

This is an area where contractual restrictions are particularly important, but so too is specification of processes and procedures to ensure that these restrictions are understood, followed and verifiably reliable.

Third parties

Often data linkage projects are outsourced to third parties, to leverage their data science skills and methodologies and to create separation from a data custodian’s personally identifying data sets.

Data analytics services providers may in controlled environments facilitate privacy protective data linkage of individual level data. Relevant controls vary, but privacy and security by design compliant arrangements for data linkage of individual level data about individuals are typically based upon four key control elements:

–    separation of persons or entities with access to personally identifying information from those persons or entities (‘trusted third parties’) conducting analytics using data sets which have been pseudonymised;

–    replacement of direct or indirect personal identifiers in the merged data sets with a linkage code, or transactor key, which enables the service provider to infer that an identifiable transactor found in each data set is a unique transactor, although not identifiable;

–    a combination of technical, operational, contractual and otherwise legally enforceable safeguards which reliably and verifiably ensure that uses of data outputs are only in accordance with stated purposes and that individuals that are the subject of transaction data are not re-identified and that records of personal information about those individuals held by any relevant party are not augmented or supplemented in any way through the controlled process,

–    information governance oversight, data process controls, change control procedures and quality assurance processes that ensure that each of these things are reliably and verifiably implemented and then reliable in ongoing operation and that any change in data flows or deviation from required practices and procedures is promptly identified, considered and (if need be) addressed by appropriate risk mitigation measures.

Such arrangements are sometimes called ‘trusted third party arrangements’. However, the requirements that engender and enable ‘trust’ should be embodied in specific contractual obligations and associated work processes and procedures to ensure that the arrangements are appropriately privacy protective.

These arrangements are accordingly not a matter of ‘trust’. The requirements are both legally enforceable and exacting to meet and to verify.

Then there’s the law

Of course, such exacting de-identification requirements are not required if affected individuals have given fully informed consent to particular data sharing. Consent is always the best solution, but often this solution is not available because a proposed release and use was outside the contemplation of the data custodians when consents were obtained and the relevant data collected.

So often a view will need to be taken as to whether the act or practice of creating and using data linkage code is an act or practice of a data collector regulated by privacy law.

Legal questions include whether pseudonymisation of personal identifiers itself an act or practice in relation to personal information which requires notice to or consent of the affected individual (and if so, how express that notice or consent needs to be), or is pseudonymisation an act or practice is akin to, say, anonymisation of personal information by aggregation up in reports and analyses?  

Further, to what extent can a party disclosing deidentified transaction level information and associated data linkage code rely upon that party’s assessment as to the likelihood of compliance of a downstream recipient with relevant prescribed safeguards? 

In other words, how active must the discloser be in verifying that a downstream recipient will meet such commitments as the recipient is willing to give as to its compliance with the requirements which underpinned the discloser’s decision to facilitate the data linkage? The answers to these questions in the Australian regulatory environment remain the subject of some debate and disagreement.

Conclusion

Data sharing of data sets between organisations is a young, but fast growing, area of data science practice. Trusted third party arrangements are a key aspect of controlled data sharing that address many of the privacy concerns that arise in relation to data release into the public domain.

But maintenance of trust of consumers and citizens and other stakeholders as to (what they understand to be) “data sharing” depends upon businesses and governments building community understanding as to how appropriate settings as to privacy and security by design address legitimate concerns as to unconstrained data sharing.

Citizens and their advocates should not be expected to rely upon assurances from governments and businesses as data collectors and sharers that they can be trusted to meet community expectations.

Requirements that engender and enable trust should be sufficiently transparent as to be understood by stakeholders.

They should also be embodied in detailed contractual and legal obligations and associated work processes and procedures in order to ensure that the arrangements are appropriately privacy protective.

The requirements are exacting, both to implement and to verify on an ongoing basis that the requirements are being met. Privacy protective arrangements by design must ensure that processes and procedures anticipate and mitigate reasonably foreseeable risks of failures of processes through human error or oversight and other things that may go wrong.

The risks of failures of controls due to poor specification or monitoring are significant.

The adverse consequences of failures may affect many projects beyond any particular project which suffers the failure: such is the interconnectedness of trust, and its loss, in an interconnected world.

This is not an area for trial and error and iterative process improvements: it is important to get data sharing and data linkage projects right from the start.

That said, we now have a developing international consensus around good practice in privacy risk management and risk responsive design in management of personal information and information security.

Australia is well placed to be a leader in development of socially responsible data sharing and data linkage initiatives.

Peter Leonard is a data, content and technology business consultant and lawyer and principal of Data Synergies, a new data commercialisation consultancy. His practice focuses on data, content and technology businesses and associated corporate transactions. Peter was a founding partner of law firm Gilbert + Tobin, and was awarded ‘Sydney Information Technology Lawyer of the Year for 2016’ by Best Lawyers International.