Google releases 25 million data sets

After 16 months in beta, Google has officially launched an archiving service that it hopes will become the world standard for sharing data on just about anything.

The company’s Dataset Search site – a fully searchable, filterable index of millions of public and private data sets from around the world – includes over 25 million data sets and 6 million data tables provided by government agencies, not-for-profit organisations, scientific research bodies, community groups, industrial scientists, and more.

The site, which can be searched and filtered like a normal Google search, is a cornucopia of data that includes, for example, information about Australia’s favourite hobbies, which dogs are the smartest, a list of Cricket World Cup winners, and the suburbs that eat the most KFC.

The most frequently contributed data sets include scientific data – specifically, those built around geosciences, biology, and agriculture – but you’re likely to find information on just about anything you’re interested in as organisations chip in with data sets containing scientific research, consumer surveys, historical records, sports statistics, weather observations and more.

“Access to this data is critical to facilitating reproducibility of research results, enabling scientists to build on others’ work, and providing data journalists easier access to information and its provenance,” Google Research research scientist Natasha Noy and co-developers Matthew Burgess and Dan Brickley noted in a recent paper about the project.

“The approach relies on an open ecosystem, where dataset owners and providers publish semantically enhanced metadata on their own sites. We then aggregate, normalize, and reconcile this metadata, providing a search engine that lets users find datasets in the ‘long tail’ of the Web.”

The Google Research division where Noy works has contributed 81 datasets of its own, including images, cartoons, drawings, audio snippets, text documents, and even shaky handheld bike-riding videos that it uses to train its own artificial-intelligence, image enhancement, language translation, and other products.

So far, early users have been using the site for more pedestrian purposes: the most common queries, Google says, include data sets related to education, weather, cancer, crime, soccer, and dogs.

Indexing the world’s data

Gathering together large numbers of data sets has been a longtime goal for organisations struggling to stay on top of data that is growing so quickly that by 2025, current projections suggest, the world will create 463 exabytes (463 million terabytes) of new information every day.

Finding and using that data is a considerable challenge, and individual efforts to organise small corners of the Internet have created thousands of individual repositories that aren’t interconnected.

The National Library of Australia (NLA), for one, has worked with over 1,000 organisations around the country to archive some 271 million historical Australian cultural documents, as well as 6.2 billion Australian Web sites, within its Trove collection.

Dataset Search can link these and myriad other collections – as the NLA has already done by connecting several data sets including its sheet music collection, FOI Disclosure Log, Trove people and organisations data, and Picture Australia metadata.

To be listed on the site, however, organisations must label their data using an industry-standard metadata structure defined by industry group Schema.org.

The Dataset description from Schema.org – supported by Microsoft, Yahoo and Yandex and managed by Brickley – outlines a range of tags that organisational websites can use to describe their data, its accessibility features, its inclusion of audio or music, copyright controls, appropriate age range, and more.

Yet Trove’s own data archive manages data using other metadata standards such as unqualified Dublin Core (DC) records, EAC-CPF and RIF-CS – which means making data available within Dataset Search could require a major technological effort to convert them to the Schema.org standard.

The promise of greater access to its massive archive statistics has been enough to convince Statista, a US-based statistics aggregator whose own index follows data about more than 80,000 topics from over 22,500 sources.

Statista has linked large volumes of its data to Dataset Search, although users still need to subscribe to get the actual data.

The US government is the biggest single contributor to the Google effort, having linked more than 2 million data sets as part of open-data efforts that are also driving transformation within governments in Australia, UK and elsewhere.