Using data vs seeing data

When Google announced in late May 2017 that it was able to link bricks and mortar store purchases with exposure to online advertising, it marked an important milestone in the analytics of data and highlighted new challenges in the ethics of data analytics.

The details of Google’s method, which it calls “double blind matching” are still unclear, however it enables Google to measure whether or not its online advertising is driving sales at physical stores.

This has been considered the “holy grail” of online advertisement technology for many years, as it allows direct measurement of the effectiveness of advertising, outside the online retail space.

To do this, Google announced that it had obtained restricted access to around 70 percent of credit card transactions in the US. It is now using that data, along with some clever mathematics and software systems, to statistically link the act of showing someone an advertisement to subsequent sales transactions, but without knowing the explicit linkage for any individual customer, thereby protecting their privacy.

If the technical claims stand up, this is a significant advance, as it is the first publicly announced large-scale deployment of privacy-preserving analytics that merges data between two companies. The fact that it is central to Google’s business model and operating over hundreds of millions of people shows that these techniques will soon be used in many more businesses and industries around the world.

In an interview with the Washington Post, Jerry Dischler, VP of Product Panagement for AdWords, Google's online advertising service, said, “Through a mathematical property we can do double-blind matching between their data and our data … Neither gets to the see the encrypted data that the other side brings.”

This is well within the realms of technical possibility and is likely to be based on a similar approach to that being pursued by the ‘Confidential Computing’ team within Data61 at CSIRO. We are working on combining homomorphic encryption and privacy preserving record linkage to help protect people’s information in an increasingly data driven world.

Homomorphic encryption

Perhaps the most interesting and exciting advancement in securing privacy in data analytics is known as ‘homomorphic encryption’. This technology is both old and new. For more than 40 years there have been public-key encryption systems that allow the user to add up numbers while they are still encrypted, and to decrypt the results.

As a simple example, consider the problem of determining the average salary of a group of people without any of them disclosing their individual salary. Using homomorphic encryption, each person can encrypt their salary, then all the encrypted salaries can be added together, and the result can be decrypted and divided by the number of participants, which will give the answer to the problem. No individual salary will be disclosed in the process as they are encrypted.

Clearly, to do this safely, one first has to be careful to keep sensitive encrypted information away from the party that can do the decryption. Second, one must keep in mind that the result of the question may be disclosive – if there was a billionaire in the room, the billionaire’s data would dominate the result. Building systems that manage these risks is a significant engineering challenge.

More recently, in 2009 Craig Gentry at IBM developed the first “fully homomorphic” encryption system – these systems allow both addition and multiplication of encrypted numbers. This advance in capability is driving more and more innovations in this space and an ever-increasing adoption of these techniques into a range of applications.

Privacy preserving linkage

Another technology area in which there has been a recent advance is in techniques to link databases together without revealing who is in the database. These methods are based on “hashing” functions – one way algorithms that when given two very similar inputs produce very dissimilar outputs, and do so in a way that one cannot deduce the inputs from the outputs.

These functions can be used to process personal information into keys that allow matching of personal data between different databases. Again, the basic techniques for this have been known for many years. However, in the last 10 years, new methods have been established that allow matching between data even when the data has errors in it – such as spelling mistakes in names – while maintaining the required privacy guarantees.

These techniques of homomorphic encryption and privacy preserving linkage can be mixed together to enable a wide variety of analytics where the data and the people the data is about are hidden throughout the calculation.

Record linkage enables data about different people to be lined up across different databases, and homomorphic encryption keeps the data itself secret.

Applications include simple things like the calculation of statistics across multiple databases held by multiple companies, as well as far more complex things such as the generation of predictive machine learning models across data from multiple organisations, all while keeping the data secret.

This technology space is moving very fast, and has the potential to alleviate privacy and data security concerns in areas as diverse as health care – where Microsoft Research is very active – to making our cities smart without disclosing our personal data, analysing government data for policy improvement while maintaining public confidence, and, as in the Google case above, enabling collaboration between companies on data analytics while keeping their customer data private and their commercial information confidential.

These techniques also highlight new ethical concerns, as they enable applications that were not possible when they required the sharing of personal information.

Privacy and ethics

There is a general acceptance in many countries of the right to privacy, and for data, this implies a right to control who may see our data.

However, do people have the right to control the use of their data, if it does not impact their privacy?

How is this weighed against the social benefits of the use of that data, for instance in medical research or social policy? This is a complex issue that requires a national discussion.

The approach of a population census is generally accepted – individual data is collected, but only aggregate data is released. Most people are comfortable with this as there are benefits generated from the aggregated data, and there is no disclosure of information about an individual through the release or use of this data.

When privacy-preserving computation is used in analytics, it similarly enables the generation of insights across individual data without disclosure of information about the individual. In health care, it has the potential to join data from multiple databases, potentially across jurisdictional and even international borders, while protecting that data from being used in other ways.

At Data61 we see the value of this technology in providing new capabilities for industries such as telecommunications and insurance through our N1 Analytics platform.

In Google’s case above, it enables Google to understand the impact of its advertising, which drives revenue, funding the content on which their advertising is placed.

These new technologies separate the use of data from the ability to see the data.

This has huge potential to protect privacy as we enter a world of ubiquitous data analytics.

It also opens possibilities for new analytics applications and new business models that Australia is well placed to capture as the use of encrypted analytics becomes more widespread.

Google is once again leading the way, and many others will follow.

Stephen Hardy is the group leader of the Data Platform Engineering group at Data61, part of CSIRO. His team works on new technologies for data collaboration without data sharing for government and industry. Stephen was formerly the technology director for data analytics at NICTA. He has a PhD in astrophysics from the University of Sydney, and is an inventor on over 20 US patents and applications, and author of over 20 academic publications.