ChatGPT’s ability to write code has been getting worse over the last few months with the percentage of prompts that produce working code results dropping severely between March and June, a new study has found.
A team of researchers from Stanford and the University of California Berkely set out to test how the large language models (LLMs) that underpin ChatGPT – GPT 3.5 and GPT 4 – have changed over time.
The results, published in open access pre-print site arXiv, quantify a decrease in ChatGPT’s quality that has been noticed by some of its users.
For the paper’s section on code generation, the researchers took 50 ‘easy’ problems learning platform LeetCode and fed them to GPT-4 and GPT-3.5 in the form of prompts.
The models’ responses were then sent back into LeetCode for judgement. If it passed, the code was classified as ‘directly executable’.
When this test was done against the March 2023 version of GPT-4, more than half (52 per cent) of generated responses were ‘directly executable’ but the June version only worked 10 per cent of the time.
GPT 3.5 performed even worse, going from 22 per cent correct in March down to just two per cent using the June model.
As the language models got worse in their code, their verbosity – the length of the generated response – increased.
The researchers hypothesise that these two features of their experimental results are linked, writing that the June versions “consistently added extra non-code text”, often in the form of comments, despite the prompt asking for “code only”.
In one instance, GPT-4 added erroneous quotation marks that broke its otherwise functional code blocks.
Those very small changes, the researchers point out, can be “particularly challenging to identify when LLM’s generated code is used inside a larger software pipeline”.
Other topics the researchers tested were ChatGPT’s ability to reason through maths problems, whether or not it answered sensitive questions, and its visual reasoning skills. Each metric produced a noticeable change over time.
Mathematical reason offered a surprise in that the more advanced GPT-4 went from successfully reasoning through problems 97.6 per cent of the time in March down to just 2.4 per cent in June while the success rate of its predecessor GPT-3.5 went very much the other direction.
The researchers concluded that their study “highlights the need to continuously evaluate and assess the behaviour of LLMs in production applications”.
“For users or companies who rely on LLM services as a component in their ongoing workflow, we recommend that they should implement similar monitoring analysis as we do here for their applications,” they wrote.