Google’s Gemini, AI Robots & Biotech: Multimodal data in action

From building voice-command robots to dynamic virtual assistants, multimodal data is inevitable. Learn everything you need to know here!

Welcome (back) to AI Minds, a newsletter about the brainy and sometimes zany world of AI, brought to you by the Deepgram editorial team.

In this edition:

  • 🎨 How to design a multimodal conversational AI agent

  • 🎙️ Lightning Talk: Audio Data, LLMs, and breaching an untapped gold mine

  • 🐉 Using multimodal data to visualize speech (feat. Dungeons & Dragons!)

  • 💻 A breakdown of Google’s Gemini

  • 🏥 Multimodality applications in Biomedical Technology

  • 🗺️ A map of the status quo of multimodal AI

  • 🤖 Robots following voice commands!

  • 🐦 Social Media’s response to Google’s Gemini

  • 🐝 Karpathy shows off some new resources

  • 📲 Apps to check out: Graphlit, HeyDay

  • ❣️ Bonus: Multimodality & Heart Disease, and is Mamba the new Transformer?

Thanks for letting us crash your inbox; let’s party. 🎉

We coded with the brand-new Whisper-v3 over the past week, and the results were not what we expected. Check it out here!

🐎 AI that codes, writes, films, and creates for you

Design Principles For Conversational AI: A Primer - With the rise of multimodal data comes the development of increasingly dynamic conversational AI. Whether it’s an Alexa that feels human, or a ChatGPT that seems like your personal librarian, we’re on the verge of something revolutionary. Here are some key design principles for building such an agent.

Check out this Lightning Talk from MLOps World 2023! In it, former Stanford researcher and current ML Developer Advocate Jose Francisco introduces a multimodal approach to training LLMs, so that they not only gain the ability to read+write, but also to speak+listen.

In this video, we combine a speech-to-text model with a text-to-image model in order to create one large speech-to-image engine. We then use it to visualize a game of Dungeons and Dragons! 

🧑‍🔬 Beyond Gemini: Pushing the boundaries of multimodal research

Google’s Gemini paper - Of course, in order to go beyond Gemini, we must understand what Gemini is. Click the link or the image above to read the Gemini paper in its entirety.

Multimodal biomedical AI - Through the lens of multimodality, this paper explores opportunities in personalized medicine, digital clinical trials, remote monitoring and care, pandemic surveillance, digital twin technology and virtual health assistants.

Measuring self-regulated learning and the role of AI: Five years of research using multimodal multichannel data This paper reveals just how far we’ve come with respect to multimodal data. It introduces a “self-regulated learning processes, multimodal data, and analysis (SMA) grid and maps joint and individual research of the authors (63 papers) over the last five years onto the grid.”

🎥 Exposing Gemini bit by bit: Technical Report Breakdown

🐝 Social media buzz

In other news, Karpathy said “There's too much happening right now, so here's just a bunch of links,” and every link is absolutely golden. Check it out!

From audio to robot-command, here’s multimodality in action at the AI Summit in New York. A robot dog does a backflip!

… And of course, here’s one more Gemini Tweet for you, just so you can see (in precise numbers) exactly how much AI buzz is swirling around social media.

📲 Apps to Check out

Integrating AI and LLMs into your application?

Graphlit is built on a serverless, cloud-native platform, Graphlit automates complex data workflows, including data ingestion, knowledge extraction, LLM conversations, semantic search, alerting and webhook integrations. Whew, that’s a mouthful.

Check out this recent blog post here on how Graphlit, in collaboration with MapScaping, Deepgram, and OpenAI GPT-4 Turbo, revolutionizes content workflows for podcasters.

AI Apps Feature here

Heyday is an AI-powered memory assistant that resurfaces content you forgot about. It saves all the articles, documents, emails and transcribed meetings into one place. Then it goes one step further, it resurfaces them when you need them. 

Professional coaches use Heyday to create meeting notes, glean ideas from research, and surface insights from their conversations with clients. 

AI Apps Feature here

🤖 Additional bits and bytes

  • Is Mamba the new transformer? Well, check out this paper to find out! In it, the researchers identify that a key weakness of Transformer models is their inability to perform content-based reasoning. As a result, they make several improvements.

  • Animate anyone. We’ll let the demo speak for itself on this one 😉

  • If you liked the multimodal approach to medicine above, check out this paper on using multimodal AI to detect cardiovascular diseases!