Amazon AI scientists are teaching its Alexa voice assistant to speak using the voice of any human – even those who have passed away – after being trained with just a few short audio clips.
Demonstrated for the first time at Amazon’s recent re:MARS 2022 event, the new feature is shown allowing a young boy to ask Alexa, “Can Grandma finish reading me The Wizard of Oz?”
The voice assistant then obliges, synthesising the deceased woman’s voice and reading the text aloud as the boy follows along in the book.
The technology is still in development but Rohit Prasad, Amazon senior vice president and head scientist for Alexa AI, positioned it as a way of adding personality and warmth to the generic voices of today’s AI voice assistants.
“One thing that has surprised me most about Alexa is the companionship relationship we have with it,” Prasad explained, noting that “in this companionship role, human attributes of empathy and affect are key for building trust.”
“These attributes have become even more important in these times of the ongoing pandemic,” he continued, “when so many of us have lost someone we love.”
“While AI can’t eliminate that pain of loss, it can definitely make their memories last.”
Building the technology required stepping back from the problem of conventional text-to-speech (TTS) engines that allow voice assistants to speak fluidly using voices trained on many hours of recordings by studio voice actors.
Instead, Prasad explained, the engineers approached the problem as a voice conversation task and analysed the prosody of the target voice – the non-linguistic aspects of the way we speak – to feed a personal voice filter enabling Alexa to speak in the target’s voice rather than its own.
“This required intervention where we had to learn to produce a high-quality voice using less than a minute of audio versus hours of audio in the studio,” Prasad said.
Your voice is their password
Amazon may be positioning the voice-mimicry technology as a sentimental favourite that is making AI-driven assistants more human-like, but the technology is sure to find short-term favour with criminals who have already been experimenting with the use of voice deepfakes to perpetrate major fraud.
In 2019, for example, a British CEO was tricked into sending over $330,000 ($US243,000) to a scammer that used AI technology to emulate the voice of the CEO of his German parent company.
Such tactics are only likely to become more common over time as better voice mimicking technology hits the mainstream.
Using the techniques Prasad described, malicious actors could craft a convincing synthetic voice of a company executive, political dignitary or celebrity simply by training Alexa on part of a speech given at an annual general meeting, business function or other event.
The device could then be manipulated to speak all manner of convincing statements, which could be daisy-chained to facilitate fraud at a whole new level.
Ever-improving technology means such problems aren’t far off, with companies like Aflorithmic pairing ‘synthetic voice cloning’ with increasingly convincing visual deepfakes to produce synthetic humans that can – as in the case of the ‘Digital Dom’ synth launched last year – emulate real people with uncanny accuracy.
This could have implications in the metaverse, where fake voices could ultimately be ported into new environments to allow fraudsters to pretend to be almost anybody.
Ultimately, deepfake researcher and BT Young Scientist & Technologist of the Year winner Greg Tarr told Cybercrime Magazine, audio and video deepfakes are getting so good that online denizens will simply need to remain sceptical of anything they hear and can’t verify in the real world.
“As these technologies become more and more available to the public and you don’t need technical experience to make convincing fake people,” Tarr said.
“It will level out to the point where we cannot detect deepfakes – and that is something we’re going to have to get used to.”
“At that point,” he said, “we’re going to need to mature as a society rather than getting the technology mature, because there is a limit to that; we need to be less reliant on the information that we consume from unreliable sources.”