Microsoft unveils VALL-E, the AI that copies human voices: what is all about

Trained with more than 60,000 hours of words, VALL-E can reproduce human voices and related tone with impressive accuracy.

January 13, 2023

Elizabeth Smith

Times of interesting news in the context of artificial intelligence. The latest innovation in this context is VALL-E, an AI system that reproduces human voices.

The artificial intelligence model, presented to the public by Microsoft, is capable not only of reproducing the words spoken by a person. But also of managing their tone according to their emotional state.

In this article, therefore, we will go over what is currently known about VALL-E. Also, how it works, and the possible risks associated with its use.

Table of Contents

What is VALL-E, the AI that reproduces human voices, and how it works

The system that governs in operation of VALL-E is as simple as it is effective.

Through just 3 seconds of dialogue spoken by a person in fact, the artificial intelligence is able to reproduce that person’s voice. All it takes is textual input to be able to replicate text-to-speech speech, with the same tone for output fidelity that is nothing short of amazing.

To achieve this, the system has been trained with more than 60,000 hours of speech in English. For now, the only language for which VALL-E is available. With such a database at its disposal, the software is able to have a solid foundation through which to process newly received audio input.

How accurate are the results of this AI? According to reports from Cornell University, a New York university founded in 1865, the reproduction is nothing short of amazing.

In short, according to the institute, VALL-E is capable of far surpassing current zero-shot TSS systems. Which we can consider the maximum achievable result in terms of naturalness and similarity to the original voice.

Although VALL-E allows software used in this field so far to be considered outdated, it is still not a perfect mechanism. In fact, test in hand, some samples present minor problems.

In fact, among the many audio clips extrapolated so far, there are some imperfections through which it is possible to understand, at least for the most attentive ear, how the reproduced voice is not entirely natural.

That said, given the premise, it is easy to see how VALL-E, if developed properly, could become the ultimate artificial intelligence when it comes to reproducing human voices.

VALL-E benefits and risks

Advanced artificial intelligence systems such as VALL-E or the now famous ChatGPT, turn out to be surprising and interesting. But also potentially very dangerous.

If until now it was possible to edit footage by creating deep fake videos that were credible to say the least, through this system it will be possible to make even more convincing montages, perhaps even capable of fooling Intel’s FaceCatcher.

In this regard then, it will be possible to make people speak words they never said. With obvious repercussions on the fake news and/or propaganda side of various types.

While these technological developments appear fascinating and entertaining, we should consider the downside.

On the other hand, the unbridled application of AI, could create rather worrisome situations regarding public opinion and freedom of expression.