Amazon researchers have introduced the largest text-to-speech model to date, which will have improved qualities that allow you to better articulate complex sentences.
The model, BASE TTS (text-to-speech), which stands for Big Adaptive Streamable TTS with emerging skills, could lay the foundation for more human-like interactions.
Based on the research, it appears that extensive training for TTS models could improve reliability and versatility in the same way we see with large language models (LLMs) used for artificial intelligence.
Amazon's TTS BASE impresses researchers
The text-to-speech model has been trained with 100,000 hours of speech data that is in the public domain, giving the tool a “state-of-the-art naturalness.” Predominantly English was used, also some German, Dutch and Spanish data.
Additionally, researchers found that even training a TTS model with 10,000 hours of speaking can improve the ability to articulate complex sentences more naturally.
With 980 million parameters, BASE-large has been recognized as the largest text-to-speech model ever created. The team also trained smaller models, with 400 million and 150 million parameters, and 10,000 and 1,000 hours of conversation, to compare results.
The Amazon team describes the BASE TTS as a “high-fidelity model capable of mimicking speaker characteristics with just a few seconds of reference audio,” acknowledging the need for more research but acknowledging its potential.
Some of the key areas the researchers focused on were compound nouns, emotions, foreign words, paralinguistics, punctuation, questions, and syntactic complexities; Examples can be found on a dedicated web page.
With revolutionary artificial intelligence leading most of 2023, text-to-speech advances like this one in 2024 could continue to bring once-futuristic technologies into the hands of the masses, but the research team's cautious approach highlights the need. of adequate regulation amid security and privacy fears.