AI Minds #3: MMM, Whatcha Say?

Not Jason Derulo, but Voice AI... and some Siri-ously cool startups in the space.

Welcome (back) to AI Minds, a newsletter about the brainy and sometimes zany world of AI, brought to you by the Deepgram editorial team.

In each edition, we’ll highlight a special topic from the rapidly-evolving field of Language AI. We’ll also touch on timely issues the AI community is buzzing about. As a heads-up, you might want to brace yourself for the occasional spicy take or two along the way.

Thanks for letting us crash your inbox; let’s party. 🎉

⚡️ Introducing Nova-2: The Fastest, Most Accurate Speech-to-Text API

Deepgram is excited to announce the launch of our latest speech-to-text (STT) model, Nova-2, which is bigger, better, and even more powerful than its predecessor (Nova-1), which was just announced earlier this year.

It’s for this reason that this edition of AI Minds is dedicated to the topic of speech-to-text. But before we dive into the details, let’s take a quick look at what makes Nova-2 the most kick-ass speech model API on the block.

🦾 We Have the Technology: Nova-2’s Enhanced Capabilities

Trained on over 1 million hours of audio data, Nova-2 uses a Transformer-based automatic speech recognition (ASR) model architecture.

What does that deliver? The short answer: Accuracy. What kind of accuracy? As compared to Nova-1, Nova-2 offers:

  • 18% reduction in WER

  • 22.6% more accurate punctuation

  • 31.4% improvement in capitalization error rate

OK, but what about the competition? On accuracy, Nova-2 outperforms its closest commercial competitor by 16.8% and beats OpenAI’s open-source ASR model, Whisper-large, on WER by a whopping 36.4%.

Those are just overall numbers though. Nova-2’s inference accuracy is especially good (at least vis-a-vis the competition) for phone call and media-parsing tasks.

Surely, you must be thinking, access to such a model will cost bazillions of dollars. Nope: Nova-2 pricing starts at just $0.0043 per minute.

To review: Deepgram’s Nova-2 model demonstrates state-of-the-art performance on inference accuracy, inference speed, and cost effectiveness across domain-specific SST tasks. But here’s the TLDR version of the rest:

🎯 Nova-2 delivers 18% higher accuracy than its predecessor.

🗣️ Nova-2 beats the competition on real-time transcription accuracy by 30%.

🏎️ Median inference time for 1 hour of audio is just 29.8 seconds.

💸 API pricing for Nova-2 kicks off at just $0.0043/minute

Long story short: Nova-2 is pretty nifty, and there’s a lot more to learn in the blog post linked above.

🗣️ The Expanse of Voice

ASR has come a long way. From on-paper human transcription to rule-based mechanical operations, to where we’re at today: afforded with end-to-end deep learning models to automate the most boring of workflows.

But a tool is only as good as its specific use case. And at least when it comes to Language AI, rock-solid ASR is the foundational platform for an expanding universe of AI apps. Here, we highlight a series of use cases and showcase a company (or two) working to realize the potential of language AI in the enterprise and beyond.

🏗️ Building Infra for AI Builders

Any fan of flowcharts knows that one idea oftentimes logically connects to another.

But what's the middle point between intuition and code? For skilled programmers seeking out a prototyping tool, and up-and-coming folks who prefer connecting the dots to writing code, (and everyone else in between) there's AirOps.

A playground for AI newcomers on one side and a full-featured developer platform on the other, developers building with AirOps can visually define ongoing workflows and preview results as they're delivered. AirOps enables developers to specify, implement, and launch AI apps built with basic building blocks like search, inference, and projections through an interface that's as easy as drag-and-drop and filling in the blanks.

Backed by Altman FO Apollo Projects, alongside Founder Collective, and lead investor Wing Venture Capital, AirOps has raised at least $7 million in total funding so far.

😲 Computing Human Emotions

Ask someone how they're doing and the way they answer will tell you a lot. There's a world of difference between a bright and chipper "Oh, great!", a sardonic "Oh, greeeaaaat" accompanied by an eye roll, and a flat-sounding "Ohhh great." one says while staring blankly into the middle distance. On paper, they’re the same words. In practice, the meaning is how the message was delivered.

For most humans, picking up on these queues is instinctual. For computers, suffice it to say that emotional valence just doesn't compute. Hume AI is aiming to change that. The NYC-based startup is building a toolkit for developers to measure and understand human emotion. A single API gives developers access to several different AI models, from a voice model that can classify over 50 different emotions from speech data to multimodal model which incorporates voice and facial expression data to predict anger or excitation. The company has also been fairly transparent with their research, and it's compiled and published several open datasets, including a 400,000+ sample dataset of facial expressions and a 300,000+ sample set of vocal utterances expressing everything from distress and disgust to desire.

Hume AI was founded in March 2021 and has raised $17.7 million in funding to date from Union Square Ventures, Aegis Ventures, and Comcast Ventures, among others.

💹 Transforming Earnings Calls into Investment Intelligence

"Building in public" has become a bit of a meme in startup circles, but for public companies, it's not just an ethos; it's the law. On a quarterly basis, company leadership and their cadre of beancounters spill the beans about, among other things, how much money the company made and what their plans are. And it turns out there's a lot more to be learned from an earnings call than just numbers on a spreadsheet.

It's that vein of voice data that Quartr is mining. Straight outta Stockholm, the startup transcribes earnings calls en masse and compiles those transcripts—along with slide decks and earnings reports—into a searchable database. Ingesting reports from "roughly 8,000" public companies yields a lot of language data with which investment researchers can answer questions: Are retail executives using the word "inflation" more frequently now vs. last year? How frequently do semiconductor companies mention "geopolitical risk" and "Taiwan"? Are snack makers afraid of "GLP-1 agonists"? When did everyone start talking about "Large Language Models'' anyway?

Originally launched as a mobile app, Quartr is now also available on desktop, and the data service provider also launched an API to enable other financial application developers to integrate Quartr's data and live earnings call feeds into their apps. The startup has raised $7.1 million in seed capital since its inception in 2020; it's last reported funding round was in July 2022.

☎️ Removing Bias and Making Call Centers Better for Employees

Founded by Stanford engineers in honor of a friend, another Stanford engineer, Sanas is an accent translation technology that aims to remove the bias barrier for non-native local language speakers on phone calls and other technical modes of communication. Though there could be several applications for Sanas, the current target market is call centers and customer service departments. The company states that by removing the accent barrier, Sanas empowers employees to do their jobs with a psychological security that they will not be berated, harassed, or otherwise due to their accent.

Sanas currently supports accent neutralization to and from English-American, Australian, British, Filipino, and Spanish. Case studies showed that using the product improved customer understanding by 31% and improved customer satisfaction by 21% while protecting the mental health of the employees who choose to use it.

In June of 2023, Sanas announced a $32,000,000 Series A raise, the largest Series A round of any speech technology company to date, with participation from top venture capital firms like Insight Partners, General Catalyst, GV, and more.

🚑 In Emergency Response, Efficiency is Everything

The 2020-vintage company, Prepared, is an end-to-end AI solution developed for emergency 911 callers and emergency response teams. Emergency dispatch centers armed with Prepared Live can send a link to callers that allows callers to live stream video, communicate over text, and provide a precise GPS location. The enhanced version of Prepared Live, the paid subscription, offers language translation, external integrations, and more.

For first responders, the company developed Prepared Onscene, a product that integrates with Prepared Live, allowing the data provided by the caller to be transmitted to the responding units before they arrive. This allows firefighters, police, and emergency medical services to see fires, situations, or victims before they are physically present, allowing time for mental preparation for the work ahead. Unlike Prepared Live, there is no freemium model for Onscene and is only available through a subscription.

Lastly, Prepared AI is the company’s transcription, translation, and workflow automation product intended to assist 9-1-1 dispatchers in doing their jobs more efficiently in the wake of the labor shortage. To date, Prepared has raised over $11 million across two Seed rounds, with First Round Capital leading.

📝 Streamlining Studying for Success

Whiteboard is an AI-powered educational study tool geared towards college students. The platform has various features that are intended to help students study more efficiently by removing lecture fluff to focus on the materials that matter. The Whiteboard platform allows students to import their recorded lectures; it then uses an LLM to notate the lecture in clear and concise sections alongside the recording.

In addition to its “Video Chat” feature, Whiteboard offers enhanced notes on reading materials, flashcard development, and AI-powered tutoring. The AI tutoring collects applicable user materials and will point said users to the appropriate content that answers their inputted questions. The platform has several subscription options, including a free plan with some content limitations. Individual students can subscribe to Whiteboard; alternatively, it can be provided to students by their universities.

Honorable Mentions

The applications of voice technology are as diverse as the folks, worldwide, who speak every day. And to avoid turning this edition of AIMinds into a tome that only David Foster Wallace or William T. Vollmann could be proud of, we want to highlight some other interesting use cases in a couple lines or less:

  • CoNote - An all-in-one tool to transcribe, analyze, and organize your data while you focus on what’s essential—collaborating with your team.

  • UpHeal - The transcription platform that keeps your eyes on your patients and frees up your time after sessions.

  • Transcript.lol - Your lickety-split source for transcripts of whatever you just worked on, from podcasts to team meetings and more.

  • Stimuler - Your on-the-go English teacher that will always meet you where you’re at, no matter where you’re at.

BTW, if you’re a startup building voice into your product, check out Deepgram’s startup program below. 👇

🌟 Get up to $100,000 in speech-to-text credits.
Join the Deepgram Startup Program 🚀