In 2022, it rained AI tools, and new ones are still appearing, astounded people with their ability to create eerily convincing essays, artwork, and videos with nothing more than a text prompt. However, while AI that can generate text, images, and video has been in the spotlight for a while, it appears that speech hasn't received much attention.
Microsoft is changing that with Vall-E, a new AI Text To Speech (TTS) system that can replicate a three-second recording of someone's voice, converting written words into speech. Of course, the concept isn't new; what's new is how frighteningly good the AI is at convincing us that the output is from a real human being - when it isn't. Let's take a look at how Vall-E works, its capabilities, and its applications.
But first, what exactly is Vall-E?
Vall-E is referred to by Microsoft as a "neural codec language model." It takes a different approach than previous voice generators, which allows it to achieve far greater accuracy. One of these is that Microsoft claims the TTS training data was scaled up to 60,000 hours of English speech, which is hundreds of times larger than existing systems. This enables the TTS system to generate "high-quality personalised speech" using only a 3-second recording of any person as a "acoustic prompt."
Despite the similar sounding names, Vall-E appears to have nothing to do with Dall-E, OpenAI's deep learning model for generating images from natural language descriptions.
What makes Vall-E unique
Because of the larger training data set mentioned above, as well as other new methods, Vall-E takes a different approach than other TTS systems. According to Microsoft, this has enabled it to "significantly outperform" other products in its category in terms of speech naturalness and speaker similarity. Vall-E was also designed to deliver in "zero-shot situations," which means it doesn't need prior examples or training in a specific context - just a 3-second audio clip and a text prompt.
vall-e overview of how it works
Vall-Operation E's (Image: Microsoft)
The ability of Vall-E to preserve the speaker's emotion is perhaps the coolest aspect. Microsoft has demonstrated this capability on the TTS system's GitHub page. The 3-second audio can be said in any tone - angry, sleepy, neutral, amused, disgusted, and so on - and Vall-E will recite any text while maintaining that tone.
Giving a voice to the mute who has lost their ability to speak is one of the most obvious applications for this technology. Even extremely short recordings of a subject's voice can be used to reconstruct an extremely natural-sounding artificial voice. It can also be used by people who have difficulty speaking; they can type what they want to say, and Vall-E will convert it into speech.
Concerns AI is rarely without concerns, and it is only natural that Vall-E would bring its own set of worries. In its system paper, Microsoft acknowledges these issues, stating that Vall-E has the potential for abuse, such as spoofing voice identification or impersonating a specific speaker. We've already seen Deepfakes spread misinformation and cause confusion by fabricating false narratives about people, so it'll be interesting to see how things play out with Vall-E if and when it's made public.