AI Minds Newsletter
Posts
Voices of AI: Generative Music Models, Voice Cloning, Voice Transfer & more!

Voices of AI: Generative Music Models, Voice Cloning, Voice Transfer & more!

From speeches to songs, language models are evolving in ways we couldn't have imagined a year ago. From Suno to AutoVC, here's what the voices of AI sound like.

Jose Nicholas Francisco, Marcel Santilli & Demetrios
February 08, 2024

Welcome (back) to AI Minds, a newsletter about the brainy and sometimes zany world of AI, brought to you by the Deepgram editorial team.

In this edition:

🎤 Top 11 Text-to-Speech models of 2024
📊 History and trajectory: An in-depth look at text-to-speech AI
📰 PNAS: The making of an AI news anchor with CNN
🤠 Syracuse University: The good, the bad, the ugly, and the gray of Generative AI
📲 4 new trending AI apps to check out! From remixing songs to cybersecurity.
🎷 Text-to-music Demo: What completely AI-generated songs sounds like (Suno AI)
🐦 Twitter on voice cloning and voice transfer
📖 AI Glossary pages on TTS & STT!
🌿 In case you missed it: Ecoacoustics and analyzing forests with ML

Thanks for letting us crash your inbox; let’s party. 🎉

Oh yeah, and while you may be familiar with Deepgram’s speech-to-text API, you might want to check out our upcoming text-to-speech technology as well 🥳

🏇 AI Voice and Ears: Journeying into auditory models

The Top 11 Text-to-Speech models of 2024: Whether you’re a developer working on a personal project or an enterprise in need of an innovative new TTS model, there’s bound to be a text-to-speech provider on this list for you! All eleven models are stars, but this list outlines the strengths of each one to find the best fit for your use-case. Check it out!

Challenging LLMs: An in-depth look at text-to-speech AI: Deep learning researcher Andy Wang goes over the history and status-quo of TTS. From delving into prosody issues to analyzing the future trajectory of vocal AI, Andy thoroughly navigates the depths of this complex world. He discusses TTS both from the technical perspective and the consumer perspective, so that we all gain a fuller understanding (and perhaps warnings) of the capabilities of this tech.

🧑‍🔬 Researching the gray-areas of voice AI: News Networks & Other (Shocking) Domains…

(PNAS) The making of an AI news anchor—and its implications - “[W]e attempted to create the opening of an AI-themed episode of The Whole Story, hosted by Anderson Cooper.…Using primarily open-source software, a first-year undergraduate student was able to create an AI news anchor compelling enough to air on network television.”

(Syracuse University) The Good, the Bad, the Ugly… and the Gray - “[Generative AI] can be used for both legitimate, benign purposes, or for malicious, harmful, and in some cases illegal purposes. These applications can be broken down into, the good, the bad, and the ugly.” Here are a few examples of each.

Bloks transform words into images with AI. Wepik is an online graphic design platform that allows users to easily create professional designs and marketing materials for small businesses.

CyberRiskAI provides businesses with comprehensive checklists, templates, and tools based on industry frameworks like NIST and ISO. Their automated platform makes it easy to complete risk assessments and get an overview of security gaps.

Wepik transform words into images with AI. Wepik is an online graphic design platform that allows users to easily create professional designs and marketing materials for small businesses.

Fadr - Remix songs with AI-separated stems - it's magic for music creators. Fadr is an AI-powered music creation and remixing platform that allows users to easily create stems, remixes, mashups, and more.

🎥 Suno AI Demo: What completely AI-generated music sounds like

This video showcases Suno AI in action. It’s a text-to-music model that follows essentially the same format as Midjourney and Stable Diffusion. You enter a prompt with the song you want, describing the genre, mood, and subject matter. The AI then outputs a full song with computer-generated voices, beats, instrumentals, harmonies, and rhythms. Check it out!

These tweet threads reveal voice-cloning and voice-transfer in action. Check them out!

Let's go! MetaVoice 1B 🔉
> 1.2B parameter model.
> Trained on 100K hours of data.
> Supports zero-shot voice cloning.
> Short & long-form synthesis.
> Emotional speech.
> Best part: Apache 2.0 licensed. 🔥
Powered by a simple yet robust architecture:
> Encodec (Multi-Band… twitter.com/i/web/status/1…
— Vaibhav (VB) Srivastav (@reach_vb)
Feb 6, 2024

Below are the three pipelines he experimented with.
The first utilizes an STT-to-TTS pipeline, while the other two pipelines utilize AutoVC.
The first of these AutoVC pipelines uses the model out-of-the-box; the other pipeline is finetuned to the parallel data.
— Deepgram (@DeepgramAI)
Feb 3, 2024

📚 Glossary Pages

If you like AI Minds and want to expand your knowledge even further, check out the Deepgram AI Glossary! We update the pages every week to give you the most up-to-date and relevant information on a wide variety of terms. Here are a few:

On Text-to-Speech: This glossary entry covers everything you’d want to know about text-to-speech technology—from specific use cases to voice cloning. It even delves into the question of why you need AI to perform text-to-speech, as opposed to simply using classical computer programs. The image above displays an excerpt from this page.

On Speech-to-Text: This glossary entry, like the one above, showcases the history, use cases, and status quo of Speech-to-Text technology. What are the best STT models right now? What’s the future trajectory of STT technology? Find out here!

🤖 Other Bits and Bytes

🤯From Last Week: Listening to Forests with Machine Learning & Algorithms - This article is essentially a case-study in ecoacoustics. Brad Nikkel, also from Stanford, takes an in-depth look at machine learning’s role in analyzing elephant vocalizations in a deep, dense forest. And the audio that the scientists picked up isn’t exactly what you’d expect…