Microsoft’s TTS Tech Set to Transform Audio Experiences

Microsoft’s TTS Tech Set to Transform Audio Experiences

In several fields, including healthcare and education, text-to-speech (TTS) AI has made operations easier and made it possible to multitask, whether at home or at work.

Imagine speech bots assessing COVID-19 patients, with minimal in-person contact, and easing the workload on doctors. But also consider the instances where it is an enabler, such as when it helps those with disabilities or makes reading easier.

The best example is none other than Stephen Hawking, who used computer software to play back synthesized voice recordings. Thanks to this, many people may now listen to the late physicist’s voice.

Assistive technology known as TTS reads the text on the user’s screen aloud on a computer or tablet. As a result, this gadget is well-liked by kids who have reading difficulties, especially those who have trouble decoding.

TTS can convert text into sound with a computer or other digital device. Children who struggle with reading can benefit greatly from TTS, which can also aid them in writing, editing, and even paying attention.

It enables every digital content, regardless of type, to have a voice (application, websites, ebooks, online documents). Moreover, TTS systems offer a smooth way to read text from desktops and mobile devices.

Since they provide readers with a high level of convenience for both personal and business purposes, these solutions are becoming more and more popular. Microsoft recently created a brand-new TTS approach.

The VALL-E neural codec language model is created by Microsoft. Before making waves that mimic the speaker while preserving the speaker’s timbre and emotional tone, the AI tokenizes speech.

The study report asserts that VALL-E is capable of producing high-quality, personalized speech using only a three-second enrolled recording of an oblique speaker as the audio stimuli.

The method produces the required effects without the need for additional structural work, pre-planned acoustic components, or fine-tuning. For zero-shot TTS techniques that depend on prompts and contextual learning, it is beneficial.

End-to-end or cascaded TTS techniques are the two categories that now exist. Cascaded TTS systems were created in 2018 by Google and University of California, Berkeley researchers. These systems typically use a pipeline that includes an acoustic model.

Researchers from Korea and Microsoft Research Asia presented an end-to-end TTS model in 2021 to simultaneously improve the acoustic model and vocoder in order to address the vocoder’s drawbacks.

In actual use, it is preferred to adopt a TTS system to any voice by enlisting unusual recordings.

As a result, zero-shot multi-speaker TTS solutions are becoming more popular, with the majority of research concentrating on cascaded TTS systems.

The model was later shown to be capable of producing high-quality outputs for in-domain speakers using just three seconds of enrolled recordings by Google researchers’ testing in 2019.

The quality of invisible speakers was also improved by Chinese researchers in 2018 utilizing sophisticated speaker embedding models, while there is still room for improvement.

Additionally, VALL-E maintains the legacy of cascaded TTS but uses audio codec code as intermediate representations in contrast to earlier research from Chinese academics at Zhejiang University.

Without requiring fine-tuning, pre-designed features, or a sophisticated speaker encoder, it is the first to have strong in-context learning capabilities such as GPT-3.

How does it function?

VALL-E provides audio examples of the AI model in use. One of the examples requires VALL-E to duplicate the “Speaker Prompt,” a three-second auditory indication. The first example, “Baseline,” represents traditional text-to-speech synthesis, and the second sample, “VALL-E,” is the model’s output.

The evaluations’ findings show that VALL-E works better on LibriSpeech and VCTK than the most sophisticated zero-shot TTS system. Additionally, using VCTK and LibriSpeech, VALL-E even produced cutting-edge zero-shot TTS outcomes.


The researchers claim that although VALL-E has made great progress, it still has the following issues:

  • The authors of the study point out that voice synthesis occasionally generates confusing, missing, or redundant words. The primary cause is that the attention alignments are disordered since the phoneme-to-acoustic language section is an autoregressive model, which means there are no constraints on solving the issue.
  • Even 60,000 hours of training data cannot account for every conceivable voice. This is especially true for speakers with accents. Because LibriLight is an audiobook dataset, the majority of the spoken words have a reading-style accent. So, the variety of speaking modes needs to be expanded.
  • To forecast codes for various quantisers, the researchers now employ two models. A promising next step is to predict them using a broad universal model.
  • Due to VALL-ability E’s capacity to synthesize speech while maintaining speaker identity, there are potential risks in misusing the model. These risks include instances like voice ID spoofing or impersonation.


In recent years, speech synthesis has been improved through neural networks and end-to-end modeling. Vocoders and acoustic models are now used in cascaded text-to-speech (TTS) systems, with spectrograms acting as intermediary representations.

A single speaker or a panel of speakers can provide high-quality speech using modern TTS systems.

Moreover, TTS technology has been included in a variety of software and hardware, including e-learning systems, and virtual assistants like Alexa from Amazon, and Google Assistant.

Moreover, it is used in marketing, customer service, and advertising to energize and personalize relationships.

Leave a Reply

Your email address will not be published. Required fields are marked *