Humans outperform AI in Australian government trial

An exploratory trial of an artificial intelligence system by Australia’s financial services regulator earlier this year found the technology performed worse than humans, newly released documents show.

The proof-of-concept testing for the Australian Securities and Investments Commission (ASIC) was carried out by American cloud computing giant Amazon Web Services (AWS) over a five-week period between January and February 2024.

The trial used Meta’s Llama2-70B chatbot to summarise public submissions made to a parliamentary inquiry, which were then compared to summaries created by humans.

The large language model (LLM) was found to have “performed lower on all criteria compared to the human summaries” and could have potentially created more work for humans if it were used in ASIC’s work, the agency said.

“The findings support the view that [generative] AI should be positioned as a tool to augment and not replace human tasks,” ASIC wrote in a response to the Senate Committee on Adopting Artificial Intelligence on 5 July, which was published publicly on Tuesday.

Appearing before the committee on 21 May, ASIC Chair Joe Longo said the AI had given “a bland summary” of the submissions it was asked to examine.

“It wasn't misleading, but it was bland,” he said.

“It really didn't capture what the submissions were saying, while the human was able to extract nuances and substance.”

However, in a draft AWS report also released on Tuesday, the company stated the AI had provided misleading information in some of its summaries.

In response to a question about the trial from Greens Senator David Shoebridge, ASIC said its tests had sought to “understand the future potential use of genAI” and confirmed the technology “was not used for ASIC’s regulatory or operational purposes”.

ASIC argued the LLM’s performance was likely hindered because testers only spent one week optimising the model and its prompts, in what was a specific use case.

AWS, which has its own competing AI foundation model named Titan, declined to comment when contacted by Information Age.

AI ‘could potentially create more work’

In its draft report, AWS noted that assessors from ASIC “generally agreed that AI outputs could potentially create more work if used (in current state), due to the need to fact check outputs, or because the original source material actually presented information better”.

Five ASIC employees assessed summaries generated by both humans and the LLM but were not told that some of them were generated by an AI, the company said.

The human summaries scored 61 points out of a maximum of 75 (81 per cent), while the AI scored only 35 points (47 per cent).

_{Amazon Web Services (AWS) carried out the trial using Meta’s Llama2-70B chatbot. Photos: Shutterstock}

Three of the five ASIC assessors allegedly said they “suspected this was an AI trial” once they were told about use of the technology.

The assessors found the AI’s output was at times “difficult to use”, “wordy and pointless”, included incorrect information (often called hallucinations), missed key points, used irrelevant information, and “made strange choices about what to highlight”.

The LLM was also found to have struggled to summarise complex information which required “a deep understanding of context, subtle nuances, or implicit meaning”.

“The finding emphasises the importance of a critical human eye which can ‘read between the lines’ and not take information at face value,” AWS said in its report.

The company noted that other promising LLMs were released during the trial.

“The results do not necessarily reflect how other models may perform,” it said.

ASIC said it had taken “a range of learnings” from the trial, including “the value of robust experimentation”, the importance of prompt engineering, and the need to evaluate both AI’s uses and shortcomings.

“Technology is advancing in this area and it is likely that future models will improve performance and accuracy of results,” the agency said.

_{An example of prompt engineering used during the trial to improve the chatbot's responses. Image: AWS 'Generative AI Document Summarisation Proof Of Concept'}

Senator Shoebridge, who had asked ASIC to share its report from the AWS trial, said that while it was “hardly surprising” humans sometimes performed better than AI, the technology needed to be tested in a transparent way which supported the work of humans.

"It's good to see government departments undertaking considered exercises like this for AI use but it would be better if it was then proactively and routinely disclosed rather than needing to be requested in Senate committee hearings,” he said in a statement.

Automation vs augmentation

Dr Emmanuelle Walkowiak, an economist at RMIT University whose research covers the intersection between technology and work, agreed it was good to see the government experimenting with AI, but argued automating tasks usually done by humans revealed the technology’s risks.

“I think it shows that you can't allocate all tasks to an LLM, because for some tasks an LLM will produce some errors, or will produce an output with a lower quality,” she told Information Age.

Walkowiak said while this particular government trial did not assess how generative AI could help to augment workers’ tasks and improve their productivity, it was worth examining that in the future as models improved.

“If you use generative AI to increase the productivity of workers, that's where you have this potential of creation of new jobs, new tasks,” she said.

“Because what will happen in that situation is that workers will focus on their area of expertise.”

Walkowiak said the trajectory of early-stage technology adoption was “really important”, as her research had found organisations were often reluctant to revert technological changes once they were implemented.

She said if governments did not regulate AI with its risks in mind, they may see productivity gains but could also increase the likelihood of issues arising from the technology.

"You really need to think about both [productivity and risks] together to find the right balance and the right policy support, to make AI complement the work of workers while mitigating these AI risks,” she said.

“I do believe that it's a critical time right now to act, for businesses, for governments, for regulation — and it will shape the decisions of employers for sure.”