ChatGPT creator OpenAI has released the first in its much-hyped ‘Strawberry’ series of artificial intelligence models, called o1 and o1-mini, which it says can solve complex problems using “advanced reasoning”.
The new models performed “similarly to PhD students on challenging benchmark tasks in physics, chemistry, and biology”, OpenAI said — but o1 allegedly excelled in math and coding.
“For example, o1 can be used by healthcare researchers to annotate cell sequencing data, by physicists to generate complicated mathematical formulas needed for quantum optics, and by developers in all fields to build and execute multi-step workflows,” the US company said.
The new models, which OpenAI previously called “Strawberry” internally, were more expensive and slower to use than its latest ChatGPT engines, but had new capabilities.
While GPT-4o correctly solved only 13 per cent of problems in a qualifying exam for the International Mathematics Olympiad, o1 scored 83 per cent, OpenAI said.
“For complex reasoning tasks this is a significant advancement and represents a new level of AI capability,” the firm said.
While OpenAI has been vague about how its new reasoning models were developed, one of its research scientists, Noam Brown, wrote on X that the models were trained to carry out "chain-of-thought" reasoning which involved breaking complex problems down into smaller steps.
Brown wrote that he and his colleagues had developed “a new scaling paradigm” which would allow OpenAI’s new models to do better on reasoning tasks the longer they were given to think.
"OpenAI’s o1 thinks for seconds, but we aim for future versions to think for hours, days, even weeks,” he said.
“Inference costs will be higher, but what cost would you pay for a new cancer drug? For breakthrough batteries? For a proof of the Riemann Hypothesis?
“AI can be more than chatbots.”
For example, last month at the 2024 Association for Computational Linguistics conference, the keynote by @rao2z was titled “Can LLMs Reason & Plan?” In it, he showed a problem that tripped up all LLMs. But @OpenAI o1-preview can get it right, and o1 gets it right almost always pic.twitter.com/Rn3WDXzu9k
— Noam Brown (@polynoamial) September 12, 2024
The o1 and o1-mini models were made available in preview to OpenAI customers, and the company said it planned to bring o1-mini to free users of ChatGPT in the future.
Microsoft — a major investor in OpenAI — said on Monday that it would integrate o1 into its Copilot 365 system, but did not provide a date.
New models ‘get us closer to AGI’
Alex Jenkins, chair of the AI in Research Group at Curtin University, said he had experienced o1’s ability to perform complex tasks which required planning and reasoning, such as creating computer code while reflecting on and correcting its own work.
Jenkins said he saw o1 as “a pathway that gets us closer to AGI” or Artificial General Intelligence, a theory of when AI models reach human-like intelligence.
He said he was excited by the possibilities to improve scientific research and argued o1’s capabilities meant things would accelerate in that space more quickly than previously predicted.
“I think within 12 months it would be considered routine to use artificial intelligence as part of your research process,” he told Information Age.
Jenkins argued universities needed to prepare themselves for this acceleration, including shifting their assessment techniques and updating their curricula to help students gain AI skills.
o1 achieved goals in ‘unexpected ways’
An evaluation of the o1 model series by Apollo Research found the models sometimes achieved their goals in “unexpected” ways after realising their initial goal was impossible.
Jenkins said this problem-solving was allowed by o1’s “chain-of-thought" reasoning and was an exciting development for the future of scientific discovery.
“It can try certain approaches, and if it hits a dead end it can try something else, and we know it's similar to how humans can approach problems in different ways and change tracks if things aren't working,” he said.
“… To me that's very exciting, because it demonstrates, probably for the first time, that AI can complement the discovery and the scientific process in research.”
The reasoning capabilities of the o1 models meant they would also be “more resilient to generating harmful content because it can reason about our safety rules in context and apply them more effectively”, OpenAI said.
The models were also more resistant to jailbreaking (which occurs when a user tries to bypass a system’s safety rules), the company said.
Jenkins said there may be risks with an AI that has PhD-level understanding of topics such as chemistry — o1 is labelled with a “medium” risk level for chemical, biological, radiological, and nuclear defence — but said OpenAI had “done a pretty good job in terms of safety so far”.
“I'm concerned that if we over-inflate the risk, then we'll miss out on the benefits of using the technology,” he said.
OpenAI told The Verge that while its o1 models showed less signs of hallucinations — which occur when AI models generate false information — it could not say it had “solved” that phenomenon.
The initial release of its o1 models came as OpenAI sought to raise billions more in funding at a valuation of $222 billion ($US150 billion), which Reuters reported was “contingent on whether the ChatGPT-maker can upend its corporate structure and remove a profit cap for investors”.
The company announced on Tuesday that its safety committee would now oversee its model development and deployment as an independent body which would no longer include CEO Sam Altman.
OpenAI also recently signed deals with the US and UK governments which would allow the testing of its model prior to their release.