An artificial intelligence project supported by a consortium of technology companies but staffed by dozens of unpaid volunteers is aiming to create what it says will be the nation’s first open-source large language model (LLM) built with Australian English and local culture in mind.

Kangaroo LLM, which is preparing to collect data from millions of Australian websites, hopes to become a nonprofit foundation spun off from its parent company Katonic AI, the firm’s founder and Kangaroo LLM steering committee chair Prem Naraindas told Information Age.

“It is going to be a nonprofit organisation focused on the mission to build large language models which are Australia-focused,” he said.

The project, which has been in the works for around six months, has been backed by five technology companies: Katonic AI, RackCorp, NEXTDC, Hitachi Vantara, and Hewlett Packard Enterprise.

Three more companies were expected to come onboard soon, Naraindas said, with each contributing its own technology to the project and doing “most of the heavy lifting and hard work”.

But over the next six months, around 100 unpaid volunteers — at least 10 of them working full-time — would be needed to make the project a reality, said Naraindas, who called for more people to apply to contribute.

‘The idea is not to exploit’

Naraindas defended Kangaroo LLM's seeking of unpaid volunteers and argued many people were keen to work with the consortium, including both experienced workers and recent graduates.

“The idea is not to exploit anyone, right? The idea is to make it a win-win for all,” he said.

Job advertisements posted by Kangaroo LLM on Tuesday called for “enthusiastic volunteers” to apply for roles working remotely for “approximately 10-15 hours per week”.

An advertisement for a Bot Lead said suitable candidates would have experience in data engineering, web crawling, programming, and web technologies, as well as “knowledge of Australian privacy laws and data governance”.

A separate advertisement for a Dataset Lead said candidates would ideally have experience in machine learning (a field of AI), data science, linguistics, natural language processing, and large-scale dataset management.

Both advertisements suggested successful volunteers would “gain valuable experience” as well as “networking opportunities” and the chance to “contribute to a project with significant national impact”.


Kangaroo LLM has been in the works for months, but was officially announced in August. Photo: Kangaroo LLM / Supplied

Naraindas said on Wednesday that Kangaroo LLM had received more than 15 applications since posting the advertisements.

“I think by the second week of October we will have about 45 to 50 volunteers working and in the process of getting them on board,” he said.

He added that Kangaroo LLM was yet to hear any concerns from people about the roles being unpaid, but the firm was still figuring things out.

“Hopefully that won’t land us in trouble,” Naraindas said.

"… And of course, in the beginning everyone gets excited, but at some point in time people want to be compensated.

“So at least the core team around the chief program officer, the chief technology officer, the chief executive officer … will be compensated at some point of time when we start having some influx of money through this.”

Kangaroo LLM’s chief product office (CPO) Sarina Hinduja, who has been tasked with overseeing the project's LLM development following previous roles with the likes of Westpac, PwC, and IBM, was one of the current full-time volunteers, Naraindas said.

He added that CTO and CEO announcements were forthcoming, and the firm was also looking for a volunteer privacy and data law adviser.

Asked about resumes Kangaroo LLM had already received from prospective volunteers, Naraindas said applicants saw there was “potential” in the firm.

“A lot of people want to be part of this because it helps them — apart from their daily job, which might be monotonous — it helps them learn something new,” he said.

“… Eventually they'll support in bringing in these corporates and we'll have money flowing in as well.”

Naraindas said he hoped Kangaroo LLM would eventually have large corporate clients who paid annual fees to access its Australian-focused data set while developers, startups, and researchers could potentially access some of its tools for free.

“It requires a lot of money, and we definitely need to find that money,” he said.

“Of course there could be government grants later, but at this moment we have to take it off from somewhere.”

Australian website scraping delayed

While Kangaroo LLM previously announced it would begin collecting data from more than 754,000 Australian websites beginning 25 September, Naraindas said that had been delayed following some concerned comments on social media.

He said the group was building a team of legal advisers while its so-called Kangaroo Bot web crawler was on standby.

“We don't want to build and then apologise, right?

“We want to ensure that we are very transparent about what we are doing,” he said.

“Of course, there are laws, but a lot of these laws are not really easy to understand.”

Kangaroo LLM has described its Kangaroo Bot as an “ethical” web crawler which would accept websites’ wishes if they did not want to be used for AI training.

The bot would respect sites’ robots.txt files (which provide instructions to web crawlers) before personally identifiable information was also removed, Naraindas said.

He said websites could opt out of the data scraping by adding an instruction to their robots.txt file (the text can be found on the Kangaroo LLM website), or by emailing the group.

“If somebody wants to opt out, they can opt out — their data will be scrubbed away from the large language model on a monthly basis as well,” he said.

In total, around 4.2 million Australian websites had been identified for potential data collection, Naraindas said.

Kangaroo LLM previously said it would publicly release a list of sites it proposed to scrape, but this was yet to be published at the time of writing.

The Kangaroo Bot’s web crawling would allow for the creation of what Kangaroo LLM called its “VegeMighty Dataset”, the group said.

Kangaroo LLM said it aimed to then use its so-called “Great Barrier Reef Pipeline” to refine that data set “into high-quality content for Kangaroo LLM training”, focused on Australian culture.

Data would also be contributed by the companies which were part of the consortium, Naraindas said.

“We felt that there was a lack of a large language model which understands Australian context,” he said.

“… We want to ensure that the large language model which Australia has is aligned with our ethical and regulatory thinking as well.”

Naraindas said Kangaroo LLM was hoping to launch the first version of its AI model in late October or early November, but admitted “there could be a few weeks’ delays”.

Information Age contacted consortium members RackCorp, NEXTDC, Hitachi Vantara, and Hewlett Packard Enterprise for comment.