A newly developed text-to-speech language model from Microsoft can mimic a person’s speech from as little as three seconds of sample audio – but the tech giant is keeping its tech under wraps for now so it is not misused.
One of Microsoft’s artificial intelligence (AI) research teams publicised a paper about its model for text-to-speech (TTS) synthesis last week.
The model is called VALL-E – no doubt a nod to OpenAI’s image generating AI DALL-E – and it shows a remarkable ability to copy not just a speaker’s voice, but also their emotional intonations (like anger) and the acoustic properties of the recording (like reverberation).
It was trained on 60,000 hours of audio containing 7,000 unique speakers using a set of 16 expensive 32GB NVIDIA graphics cards.
The result is a text-to-speech model that can produce audio mimicking speakers who didn’t appear in the training data using a small, three-second sample of their speech.
In the field machine learning, this is known as a ‘zero-shot’ problem and gives the model far greater scope than an AI that needs to be trained on hours’ worth of a single person’s speech to accurately mimic them.
You can listen to samples of VALL-E on a site setup by the research team. Some of the results do sound computer-generated and others don’t vary much to the baseline sample – which are limitations the paper recognises.
But overall, VALL-E’s speech mimicry and ability to re-produce audio environments is impressive to hear, so much so that the Microsoft team is preparing a method of detecting whether audio has been generated by VALL-E.
“Since VALL-E could synthesise speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker,” the paper says.
“To mitigate such risks, it is possible to build a detection model to discriminate whether an audio clip was synthesised by VALL-E.
“We will also put Microsoft AI Principles into practice when further developing the models.”
One possible practical application of this technology – beyond impersonating celebrities – is in creating audio data for speech recognition systems like Siri and Alexa.
“Speech recognition always benefits from diverse inputs with different speakers and acoustic environments, which cannot be met by the previous [text-to-speech system] systems,” the paper’s authors note.
“Considering the diversity feature of VALL-E, it is an ideal candidate to generate pseudo-data for speech recognition.”
Microsoft also suggested a fully AI-based content creation system combining text generation models like GPT-3 which was made famous by ChatGPT. The podcasts and audiobooks of the future could be entirely written and spoken by generative AI.
Already the creative industries are struggling to deal with the rise of AI art threatening to take over work that used to belong to artists.
Digital artist Ben Moran made news recently after they were banned from the popular Art subreddit for submitting AI-generated art, which is against the community guidelines.
Moran is adamant they genuinely created the image, not an AI.