Text-to-Speech: What exactly will the future sound like?

Learn about zero-shot text-to-speech models, how to embed emotions into TTS, and the challenges of vocal data.

Welcome (back) to AI Minds, a newsletter about the brainy and sometimes zany world of AI, brought to you by the Deepgram editorial team.

In this edition:

  • 📈 The top 7 text-to-speech startups of 2023

  • 🏛️ The history of text-to-speech

  • 📺 Video: How hard can building a TTS model really be?

  • 💿 A Vector-Quantized Approach to TTS

  • ✏️ A survey on Diffusion Models how affected TTS

  • 🗣️ Zero-shot TTS with Emotions

  • 📲 New AI Apps: Willowrid, Jumpspeak, Snapvid

  • 🤖 Bonus content: User confidence in OpenAI & Pressure tests

Thanks for letting us crash your inbox; let’s party. 🎉

We coded with the brand-new Whisper-v3 over the past week, and the results were not what we expected. Check it out here!

🐎 The text-to-speech boom is here and is still being heard

Leaving You Speechless: The Top 7 AI Text-to-speech and Voice-cloning Startups of 2023 - In this article, we showcase the best TTS startups of the past year, whether you’re building a GPS system, designing a video game, or forging a verbally conversational AI, these startups will have exactly what you need.

AI Voice Synthesis: The Early Days and Implications of Text-to-Speech - With great power comes great responsibility. What are some of the ethical questions and challenges facing Synthetic Audio today? How can we ensure that nobody uses this technology maliciously? Learn more here.

🎥 Video: How hard can building a Text-to-Speech model really be?

We can’t just write about text-to-speech without showing off some TTS in action! Below is a video that shows exactly why so much work goes into creating a brute-force TTS model… then we can truly appreciate how much work goes into creating models of actual quality.

 🧑‍🔬 How Researchers Actually Accomplish TTS

Want to learn what the latest developments are in text-to-speech technology? Check out these papers below!

We observe the mismatch between training and inference alignments in mel-spectrogram based autoregressive models, leading to unintelligible synthesis, and demonstrate that learned discrete codes within multiple code groups effectively resolves this issue. We introduce our MQTTS system whose architecture is designed for multiple code generation and monotonic alignment, along with the use of a clean silence prompt to improve synthesis quality.

Chen, et al. (Carnegie Mellon University, 2023)

This work conducts a survey on audio diffusion model, . . . . Specifically, this work first briefly introduces the background of audio and diffusion model. [Then], we divide the methods into three categories based on the stage where diffusion model is adopted: acoustic model, vocoder and end-to-end framework

Zhang, et al. (April 2023)

Emotional Text-To-Speech (TTS) is an important task in the development of systems. . . . In this paper, we propose ZET-Speech, a zero-shot adaptive emotion-controllable TTS model that allows users to synthesize any speaker's emotional speech using only a short, neutral speech segment and the target emotion label.

Kang, et al. (May 2023)

🐝 Social media buzz

Here’s what the online community is saying about text-to-speech! From Meta to Twitch, see TTS in action in the real world here:

📲 Apps to Check out

Willowrid: the AI-powered platform revolutionizing blog post generation from video content in just three clicks:

✔️ Convert video into text for blog posts enriched with visual content, ensuring seamless alignment.

✔️ Multi-language support and AI-driven video summarization simplify content creation.

✔️ Create PDF and DOCX files from articles with ease.

Automatic inclusion of relevant images enriches articles, elevating engagement beyond traditional transcriptions. Unique AI-based summarizing feature, providing quick, comprehensive video content summaries.

AI Apps Feature here

Jumpspeak Modern topics and phrases, mastering real-life dialogue. Experience lifelike accents and voices for enhanced listening skills. Receive personalized assessments covering grammar, vocabulary, and conversation nuances.


  • Engaging conversations guided by an AI Tutor

  • Real-time speech recognition for pronunciation feedback

  • Tailored insights into language intricacies

Languages supported: Spanish, French, German, English, Italian, Portuguese, Korean, Japanese, Dutch, Polish, Swedish, Danish, Vietnamese, Hungarian, Norwegian, Turkish, and Russian.

AI Apps Feature here

Snapvid helps transform your video editing process effortlessly. Add subtitles, emojis, video footage, transitions, and sound effects in seconds using intuitive AI-powered tools.

Features include custom animated subtitles and emojis, along with one-click insertion of video elements. Snapvid also offers multi-export capabilities, catering to various editing needs.

Perfect for video editors, content creators, social media agencies, and YouTubers, Snapvid revolutionizes editing workflows, saving time and enhancing creativity. Experience a seamless editing experience with Snapvid!

AI Apps Feature here

🤖 Additional bits and bytes

Generalist Large Models vs Smaller Fine-tuned Models - How far can prompt engineering go when asking LLMs about a very specific domain like Medicine? Can thorough prompt engineering hold a candle to fine-tuning? Find out in this case study!

Granite Foundation Models - It’s rare to see a paper about training an FM be this transparent. Kudos to IBM.

User Confidence in OpenAI - Following all the OpenAI news of the past week, this survey on user confidence in OpenAI (versus alternative models and providers) reveals quite an interesting result.

⭐In case you missed it - Pressure testing OpenAI and Anthropic’s long context windows.

🖥️ An upcoming webinar! The magic of Multimodal data

Come join Vonage, Deepgram, Voicify, and Flowcode as they present “The Art of the Possible: Creating a Modern Multimodal Customer Journey,” a webinar which showcases ways to create a cohesive and personalized narrative for your customers using creative, multimodal means to drive engagement.

The webinar takes place on December 12, 2023 at 12pm ET. It will be one hour long, and access will be granted upon registration at the links above! 🚀