AI Minds Newsletter
Posts
Jensen Huang and NVIDIA’s New Dataset, Apple’s new Siri, The Ultimate Guide to AI Voice Cloning

Jensen Huang and NVIDIA’s New Dataset, Apple’s new Siri, The Ultimate Guide to AI Voice Cloning

NVIDIA's new Granary dataset features around 1 million hours of audio. Apple's new Siri will be able to take actions on your behalf across various apps by following voice commands. And you can learn how to go easily from a beginner to a pro voice cloner with a simple, eleven-minute tutorial! All these features and more in today's edition of AI Minds.

Jose Nicholas Francisco
August 20, 2025

Welcome (back) to AI Minds, a newsletter about the brainy and sometimes zany world of AI, brought to you by the Deepgram editorial team.

In this edition:

🎥 Ultimate AI Voice Cloning Guide - BEGINNER To Pro 2025
🔈 Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction
🧠 Analyzing Conversational Context Recall and Utilization in Voice Interaction Models
💻 Deepgram Saga: The Voice OS for Developers
🎯 What Developers Need to Know About Speech Recognition Metrics
⚡Deepgram Partners with AWS to Accelerate Voice AI Deployment
🐝 Social Media Buzz: Zero-Shot Voice Cloning, the smallest voice agent, and more!
💻 New Webinar: Building AI Voice Agents with Deepgram + AWS Bedrock
🎙️ The AI Minds Podcast with Thibault Mardinli, Explorer at Voice AI Space
📈 Gupshup raises $60M+ to expand its conversational AI and messaging platform
💰 Best AI Meeting Notes Assistants for Fintech Teams
🔊 In Case You Missed It: GPT-5 and the Future of Voice AI
🤖 Voice Agent API Just Leveled Up: GPT-5 + GPT-OSS-20B
🍎 Apple’s new Siri may allow users to operate apps just using voice
💬 NVIDIA releases open dataset, models for multilingual speech AI

Thanks for letting us crash your inbox; let’s party. 🎉

Want a single, unified conversational AI API for building real-time, enterprise-ready, and cost-effective voice AI agents? Check out this link!

🎥 Ultimate AI Voice Cloning Guide - BEGINNER To Pro 2025

Learn how to go easily from a beginner to a pro in this eleven-minute tutorial! The three tools featured are labeled as “Beginner AI Voice Cloning,” “Mid-Level AI Voice Cloning,” and “Professional AI Voice Cloning.” Which one do you think is best?

🔍 Voice-Language Foundation Models for Voice Role-Play and the Surprising Memory of Voice Assistants

Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play - “The authors of this paper presentVoila, a family of large voice-language foundation models that moves beyond traditional pipeline systems by adopting a new end-to-end architecture that enables full-duplex, low-latency conversations while preserving rich vocal nuances such as tone, rhythm, and emotion.”

Does Your Voice Assistant Remember? Analyzing Conversational Context Recall and Utilization in Voice Interaction Models - “Recent advancements in multi-turn voice interaction models have improved user-model communication. However, while closed-source models effectively retain and recall past utterances, whether open-source models share this ability remains unexplored.” This paper fills this gap.

🧠 Deepgram Saga: The Voice OS for Developers

Deepgram Saga: Meeting You Where and How You Work - Execute complex workflows across your entire tech stack using voice or text—whatever feels natural in the moment. With one-click integrations to Gmail, Google Calendar, Slack, Notion, Linear, Perplexity, and more, Saga connects seamlessly to the tools you already use.

Deepgram Saga isn't just another voice assistant—it's a voice OS built specifically for developers. What you can do:

📅 Manage calendars and schedule meetings without leaving your IDE
📥Send emails, create tickets, and update team channels
🔍 Research solutions with Perplexity and save findings to Notion
⏩ The result? Faster execution, fewer context switches, and development that truly happens at the speed of thought.

Ready to make your voice your most powerful development tool?

Try Deepgram Saga →

⚡ What Developers Need to Know About Speech Recognition Metrics and Deepgram Partners with AWS to Accelerate Voice AI Deployment

What Developers Need to Know About WER, KER, and KRR - This article will introduce three of the most used and prominent speech recognition metrics: Word Error Rate (WER), Keyword Error Rate (KER), and Keyword Recognition Rate (KRR).

Deepgram Signs Strategic Collaboration Agreement with AWS to Accelerate Global Deployment of Voice AI - As a Generative AI Competency Partner and long-standing AWS Partner Network (APN) member, Deepgram offers a full-featured voice AI platform that includes speech-to-text (STT), text-to-speech (TTS), and speech-to-speech (STS) capabilities.

🚀 Just released on HF: Spark-TTS, an LLM-powered text-to-speech model that does zero-shot voice cloning & fine-grained voice creation — all in a single stream!
> Built on Qwen2.5
> Control pitch, speed, & speaker style directly from text.
— steven (@Tu7uruu)
10:13 AM • Mar 5, 2025

Is this the tiniest little voice agent yet?!
My @elevenlabsio voice clone running on an esp32 microcontroller via @pipecat_ai and WebRTC! 🔥
Story time: I recently caught up with Danilo Campos who is building the awesome DeskHog (seriously, check it out!) at @posthog and he
— Thor 雷神 ⚡️ (@thorwebdev)
4:30 PM • Jul 15, 2025

Are you amazed or horrified?
This AI voiceover is FLAWLESS.
Leo presenting as Joe Rogan, Steve Jobs, Robert Downey Jr, Bill Gates & Kim Kardashian.
— Lorenzo Green 〰️ (@mrgreen)
7:20 AM • Feb 16, 2023

Thibault Mardinli, Explorer at Voice AI Space. Voice AI Space is a beacon to master voice tech's wild seas. They guide developers, entrepreneurs, and enthusiasts to top tools, news, knowledge, and careers, empowering everyone in the vast voice AI ocean.

💻 New Webinar: Building AI Voice Agents with Deepgram + AWS Bedrock

Sign up here: Deepgram’s Voice Agent API brings lightning-fast speech-to-text and lifelike text-to-speech together with event hooks and speaker diarization, all in real time. Amazon Bedrock gives you instant access to leading foundation models like Claude and Titan, with built-in safety, compliance, and flexibility, perfect for powering voice agents with real intelligence.

Join us to learn how to build scalable, responsive AI voice agents that actually work in production.

⏰ When: Tuesday, September 9, 10:30am - 11:30am PDT

🧭 Where: Online

✅ Sign up here!

🤖 Bonus Bits and Bytes!

🧠 Gupshup raises $60M+ to expand its conversational AI and messaging platform - Conversational AI continues to expand its presence, as we predicted in our State of Voice Report this year.
💰 Best AI Meeting Notes Assistants for Fintech Teams - Not all meeting notes assistants are made equal. Find out what the best ones are in this article.
🔊 In Case You Missed It: GPT-5 and the Future of Voice AI - Explore how GPT-5’s new capabilities redefine what is possible for real-time voice agents and the infrastructure needed to power them.
🤖 Voice Agent API Just Leveled Up: GPT-5 + GPT-OSS-20B - GPT-5 and GPT-OSS-20B are now live in the Deepgram Voice Agent API and Playground, giving developers more choice for reasoning depth, latency, cost efficiency, and open-source flexibility.
🍎 Apple’s new Siri may allow users to operate apps just using voice - Apple is testing a version of Siri that will be able to take actions on your behalf across various apps by following voice commands
💬 NVIDIA releases open dataset, models for multilingual speech AI- The new Granary dataset, featuring around 1 million hours of audio, was used to train high-accuracy and high-throughput AI models for audio transcription and translation