OpenAI has demonstrated GPT-4 omni (GPT-4o), an “astonishing” update that gives ChatGPT a human-sounding voice, translation skills, computer vision, an emotional range – and a singing voice.

Introduced in a live demonstration led by OpenAI chief technology officer Mira Murati, the user interface changes built into the GPT-4o large language model (LLM) – which she said “will bring GPT-4 level intelligence to everyone, including our free users” as it is rolled out in coming weeks – have been designed to make interaction with the model “much more natural and far, far, easier.”

“For the past couple of years, we’ve been very focused on improving the intelligence of [GPT] models but this is the first time that we are really making a huge step forward when it comes to ease of use,” Murati said. “This is incredibly important because we’re looking at the future of interaction between ourselves and the machines.”

Through a series of demonstrations, Murati – along with head of frontiers research Mark Chen and post-training team lead Barret Zoph – showed how the GPT-4o app, which is also set to debut in a desktop app, provides a natural language interface that supports dozens of languages and queries while providing near instantaneous responses.

The new model’s faster speed meant the demonstrators could interrupt the GPT-4o voice in mid-sentence, giving it new instructions just like one person might interrupt another during the natural flow of conversation.

Asked to read a bedtime story, GPT-4o changed its tone of voice when requested, speaking in a more intense way each time it was asked to add more “drama” – then switching to a dramatic robotic voice, and singing the end of the story as well.

The multi-modal model also integrates computer vision – allowing it to, for example, interpret a written linear mathematics equation and talk Zoph through the process of solving it.

GPT-4o’s computer vision capabilities also enabled it to analyse a selfie of Zoph and infer his emotional state – “pretty happy and cheerful,” the model surmised, “with a big smile and maybe a touch of excitement”.

Once more, with feeling

The voice capabilities of GPT-4o immediately drew comparisons online to ‘Samantha’, the Scarlett Johansson-voiced AI companion from the 2013 movie ‘Her’ – which mainstreamed the idea of an emotional, human-sounding AI capable of convincing willing users that it was human.

The emotive range of the new AI is “quite astonishing,” Alex Jenkins, director of Curtin University’s WA Data Science Innovation Hub, told Information Age.

He likened the original ChatGPT to “a deaf person who read every book in the world, every journal article, and every piece of paper they could get their hands on – but they didn’t know how the world sounds.”

“They didn’t know what human speech was like,” he said, “and that obviously has an impact in terms of communicating in a human-like way, because we use expression in our voice all the time as a key component of communication.”

Although computers have been “talking” for many years, Jenkins added, “dumb” previous text-to-speech engines “didn’t understand the intent and context of the conversation. They were reading the words out and not applying inflection in any kind of meaningful way.”

“This new model understands how the world sounds and how people sound, and it’s able to express its voice in a similar fashion to what humans can do.”

The announcement quickly drew a counter salvo from Google, which announced the availability of its Gemini 1.5 Pro LLM – which adds features such as analysis of audio files and uploaded documents up to 1,500 pages long.

Availability of GPT-4o as a desktop app will also threaten Apple’s Siri – reportedly due for an AI overhaul at next month’s Worldwide Developers Conference – and Microsoft’s Cortana voice assistants, with Zoph demonstrating how he can feed the source code of the application to the desktop app and ask it questions about the information – such as what the code does or what its output means.

Progressing the technology to this point “is quite complex,” Murati said, “because when we interact with each other there is a lot of stuff that we take for granted.”

Where previous GPT models used three separate elements to produce speech – transcription intelligence, text-to-speech, and orchestration – she explained that GPT-4o integrates these capabilities natively across voice, text, and visual prompts.

The efficiency of GPT-4o is also significant because it will be the first time that OpenAI’s GPT-4 LLM – a far more powerful engine than the widely used GPT 3.5 that has previously only been offered to paying customers – is available to any user, for free.

As the benchmark against which other LLMs are measured in terms of capability, speed and security, GPT-4’s general availability will significantly boost the AI capabilities available to the mass market – with GPT-4o’s voice-driven user interface enabling a broad range of new use cases.

As well as helping in applications such as helping autistic people learn how to communicate verbally, the new model will likely be able to write poetry “that sounds like it flows, and sounds lyrical,” Jenkins said.

“We’re a long way from the sort of doomsday Skynet scenario,” he laughed.

“I think the biggest immediate risk is that we'll be inundated with a lot of mediocre poetry.”