The NSW government is sharing COVID-19 case data with the help of a world first framework developed by ACS’ Data Sharing Committee.

Data published on the Data NSW website shows the notification dates, likely source of infection, and certain location information of the state’s confirmed COVID-19 cases.

Before releasing the latest information every day, Data NSW treats the set by suppressing and aggregating certain data in order to mitigate the risk of being able to identify an individual using other available information.

At the surface, it seems like a simple problem: don’t include obviously identifiable personal features – like names, phone numbers, and addresses – and the data should be safe.

But Dr Ian Oppermann – ACS President and NSW Chief Data Scientist – who led the charge in creating a framework for anonymising data, said the metrics determining which data points to include in public sets are poorly defined.

“When wide data sets have been released historically, people sit around together and say, ‘Well what do we feel comfortable with releasing?’ – and that’s true if you’re a bank, or a credit card company, or a telco, or a government,” Dr Oppermann told Information Age.

“The problem is that everyone’s intuition, their ‘abdominal computer’, is different so there’s been no way of telling just how safe the data really is.”

To overcome that problem, Dr Oppermann and the ACS Data Sharing Committee have spent the last three years developing a quantifiable measure for data sharing that should reduce the need for people to rely on their ‘abdominal computers’ when opening data stores to the public.

Part of that work has been creating the ‘personal identification factor’; a number that helps measure the risk of an individual being re-identified from a dataset.

With a defined framework and empirical factor applied to datasets, the decisions around which data to include in a safe and informative way become more robust.

Looking through the difference in publicly accessible NSW COVID-19 cases data shows how dynamic data can be adjusted on the fly.

“We've always got one table that goes out every single day – that's notification date and postcode,” Dr Oppermann said.

“But with the ‘likely source of transmission’ we only linked that with other data sets after a certain date when there were enough cases that we were not so worried about reidentification.”

In another instance, the ‘COVID-19 cases by age range’ dataset is incomplete – only ranging from early March to late April – because its increased the potential for re-identification (measured by the personal identification factor) above acceptable levels.

One problem is that data effectiveness decreases as you remove data points.

Too few data points and it’s worth little for people who want to generate insights; too many and privacy starts to go out the window.

“With some data we’ve seen it is too sensitive and have had to dial back the level of personal information,” Dr Oppermann said.

“We’ll keep watching to make sure the risk level doesn’t creep up as rare cases come in – and if it does then we will apply extra protection to drop it back down to a level we’re comfortable with.

“But at least every day we’ve got a number that we’re looking at to say ‘it must be less than this’.”

Making the most of a bad situation

The sudden onset of COVID-19 has hastened our reliance on data.

Since the outbreak reached our shores in state and federal governments have leaned on data gathered and shared by health officials, technology giants, and a controversial app to gauge the success containment measures and help re-open parts of our community.

Dr Oppermann said the crisis also opened the door for wider collaboration around effective data governance.

“A lot of people were able to put their shoulders to the wheel and it was great,” he said.

“I wouldn’t wish the pandemic to last longer, but that sense of focus and urgency was really very useful for moving the conversation forward.”

Having conducted a successful live test of the data sharing framework, Dr Oppermann wants to see data use expanded to help improve society beyond pandemic responses.

“The ideal to me at the moment is mapping the infinite complexity of the world that we live in into outcomes frameworks: one around people, one around the economy, and one around the environment,” he said.

“There was a great opportunity to rapidly prototype the way we describe certain indicators and measure how, when levers are applied in a health response, for example, it affects levers in other domains.

“Being able to have an honest, open debate about where the dial should be set, how much is appropriate, and what’s the relative impact from one to the other embraces so much more complexity than we’ve ever been able to embrace.”

COVID-19 has been a crisis that shows how the different levers wielded by governments affect other aspects of the economy, our lives, and the environment.

Dr Oppermann is hopeful that the lessens learned by the pandemic will continue to be applied.

“As the pressure was applied to clamp down in order to drive the health response, we didn’t have an ability to see what’s happening in other outcomes associated with people,” he said.

“We want to have that understanding of things like mental health burden increase, or domestic family violence burden increase when lockdown policies are applied.

“Before, we didn’t have a way of seeing how, as we applied the policy of lockdown for health, that moves the dials of other human services outcomes, or economic outcomes, or environmental outcomes.”

Privacy, please

As governments are embracing the complexity of a data rich world, they need to be mindful of privacy and the risk that someone might use public data for nefarious means.

After the Department of Health released the medical billing records of around 2.9 million Australians to help with research, it was quickly determined that patients within the dataset could be re-identified using other publicly available information.

Clearly, there is a need to balance the usefulness of data for research and the privacy of people who have surrendered that information while going about their lives.

What the personal identification factor and data sharing framework does is strike a balance between the two: it holds that there can be an empirically determined point within datasets that can maintain utility and privacy.

“There is no problem that data will not help to better inform or better understand, provided the data quality is appropriate and the chain of governance and handling is appropriately understood,” Dr Oppermann said.

“If your data is 99 per cent pure, that’s great.

“If your data gives you 99 per cent coverage of the problem, you can make some pretty powerful insights.

“If your data is 99 per cent accurate, you can do some pretty powerful things.

“But as those different dimensions start to fall back, the quality starts to fall and what’s appropriate to do with the data starts to fall back pretty quickly."