A researcher has uncovered a gigantic and meticulously organised collection of 8.73 billion leaked Chinese records, constituting one of the world’s largest data leaks to date.

Found by Ukrainian security researcher Bob Diachenko, the leaked dataset contained not only full names, gender details, mobile numbers and home addresses, but also sensitive details such as national ID numbers and plaintext passwords.

Dates and places of birth, social media identifiers (such as usernames) and private email addresses also appeared in the dataset, increasing the risk of identity theft and potential account takeovers for countless China-based individuals.

Discovered on 1 Jan, the exposed database remained publicly accessible for more than three weeks before being closed.

Although Diachenko did not identify any signs of misuse from malicious actors, researchers at media outlet Cybernews said there was “ample time” for potential attackers to scrape the data.

“If our researchers managed to find it, there’s no reason others couldn’t too,” wrote Cybernews.

Jamieson O’Reilly, founder of information security company Dvuln, told Information Age there are “automated scanners” that constantly sweep the internet for “exactly this type of exposed infrastructure”.

“The real concern is the combination of data types present,” said O’Reilly.

“National ID numbers paired with plaintext passwords and contact details is very sticky data and highly valuable for executing targeted phishing campaigns.

“Even if some of this data has appeared in prior breaches, the value of a pre-organised, searchable compilation like this is significant to threat actors and it dramatically lowers the effort required to cross-reference and weaponise the information.”

Billions of records, organised by theme

The data was discovered on an unsecured cluster in Elasticsearch, a popular data storage solution often used by legitimate organisations for its scalability and fast data-searching capabilities.

Using this platform, Information Age understands the exposed records were distributed across unique, thematically arranged units of storage.

“The dataset structure and scale suggest intentional aggregation, not accidental logging or misconfiguration by a single consumer service,” said Cybernews researchers.

Each of these storage indexes was highly organised, while the types of data within largely matched those typically collected by data brokers for the purpose of resale.

Notably, the dataset also included various business records, such as company registration details and legal representatives, as well as aggregated “government-style identifiers”.

Diachenko did not determine precisely how many individuals appeared in the dataset.

Despite containing some duplicate data, researchers believed hundreds of millions of unique persons could have been impacted.

No single data breach

Researchers noted that the presence of timestamps and import dates pointed to a “long-running aggregation effort rather than a single historical breach”, while the data itself was imported as recently as late 2025.

Speaking with Information Age, O’Reilly said the structure of the dataset indicated it was potentially “drawn from numerous prior breaches and leaks”.

“The researchers described 163 indices, highly organised and segmented by data type: phone-centric, ID-centric, account-centric collections,” said O’Reilly.

“That level of curation, combined with the presence of national ID numbers alongside plaintext passwords, social media identifiers, and business records all in one place, strongly suggests aggregation from multiple sources over time.”

And while O’Reilly suggested some of the data may trace back to legitimate platforms given its scale and variety, he was confident that “whoever assembled this particular cluster was almost certainly operating outside any lawful framework.”

Who leaked the data?

Diachenko did not attribute his discovery to any particular owner, and Cybernews researchers could not uncover a public claim of ownership for the 8.73 billion records.

Notably, however, the infrastructure was provided by a bulletproof hosting provider, takedown-resistant suppliers that have historically been favoured by cybercriminals, including those behind Australia’s historical data breach at Medibank.

“In most cases, that’s a deliberate choice of hosting environment,” said O’Reilly.

“A legitimate organisation running an Elasticsearch cluster at this scale would typically be on mainstream cloud infrastructure.

“Bulletproof hosting reinforces that this was someone who knew they were operating in a grey area at best.”

Further, other services hosted on the dataset’s server indicated the records could have been abused for financial fraud.

“The context here points more toward someone who aggregated this data intentionally and either didn't care about securing it or lacked the sophistication to do so properly, rather than a big-name company accidentally leaving a production cluster open”, said O’Reilly.

In 2024, fully 26 billion leaked records appeared in a dataset described as the ‘mother of all breaches’, while a smaller batch of 184 million records in 2025 was found to contain user logins for the likes of Google, Facebook and Australia’s Department of Home Affairs.