AI Minds #1: LLM Benchmarks Set the Bar

Like Standardized Tests, But For AI.

Welcome to AI Minds, a newsletter about the brainy, and sometimes zany world of AI, brought to you by the Deepgram editorial team.

In each edition, we’ll highlight a special topic from the rapidly evolving field of Language AI. We’ll also touch on timely issues that the AI community is buzzing about. As a heads up, you might want to brace yourself for the occasional spicy take or two along the way.

Thanks for letting us crash your inbox; let’s party. 🎉

🔎 Benchmarks: Like Standardized Tests, But For AI

You remember that horrific running exercise for evaluating physical vigor, the FitnessGrand Pacer Test? Well, that horrendous exercise has aptly been developed for AI and LLMs to ensure that they are performing up to par. Take a moment to consider what the repercussions of low-performing GPTs and lagging LLAMAs could have on society at their current rate of adoption; if you’re sane, you might be cringing. Thankfully, some major AI players have developed benchmarks, or standardized fitness testing, to ensure performance is reliable and peak, and we’ve researched some of them for your consumption.

  • The ARC Benchmark: Like Middle School, but Not Actually - It’s not a benchmark containing two of each animal on planet Earth, instead a database full of questions you would have seen in middle school on one of those state-mandated exams. ARC is currently one of the best benchmarks for testing an LLMs abstracting and reasoning capabilities.

  • HellaSwag: It’s not Rocket Science, It’s Commonsense - Unlike the Shania Twain classic, both Rocket Science and Commonsense do impress us. In Brad’s exploration of benchmarks, he covers HuggingFace’s HellaSwag, a ridiculously cumbersome acronym for Harder Endings, Longer contexts, and Low-shot Activities for Situations With Adversarial Generations, a benchmark for AI commonsense reasoning.

  • TruthfulQA: Fact or Fiction? - The truth? You can’t handle the truth, and sometimes, that reigns true for LLMs, which is why TruthfulQA focuses on combating failures in truthfulness that LLMs won’t likely overcome just by scaling up. By reducing these AI hallucinations, models become more trustworthy and reliable.

  • MMLU: The True SAT of LLM Benchmarks - Did someone throw a bowl of alphabet soup at the wall? What the heck is MMLU? Actually, no soup was harmed in the making of this newsletter. MMLU stands for Massive Multitask Language Understanding, and it’s intended to measure knowledge a model acquired in the LLM pre-training stage

  • SuperGLUE: Not Gorilla Glue - Remember when that lady used Gorilla Glue on her head instead of hairspray? Obviously, this is not that. Follow along with Andy as he deep dives the GLUE benchmark, which aims to offer a single-number metric that quantifies the performance of a language model across different types of understanding tasks.

Be sure to keep your eyes peeled for more Deepgram analyses of AI benchmarks.

💡 Abstract Insights

In each issue of AI Minds, we share a couple of abstracts from academic papers we found particularly interesting. Here’s what the research world is reading right now.

Besta, Maciej, Nils Blach, Ales Kubicek, Robert Gerstenberger, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, et al. “Graph of Thoughts: Solving Elaborate Problems with Large Language Models.” arXiv. https://doi.org/10.48550/arXiv.2308.09687.

We introduce Graph of Thoughts (GoT): a framework that advances prompting capabilities in large language models (LLMs) beyond those offered by paradigms such as Chain-of-Thought or Tree of Thoughts (ToT). The key idea and primary advantage of GoT is the ability to model the information generated by an LLM as an arbitrary graph, where units of information ("LLM thoughts") are vertices, and edges correspond to dependencies between these vertices. This approach enables combining arbitrary LLM thoughts into synergistic outcomes, distilling the essence of whole networks of thoughts, or enhancing thoughts using feedback loops. We illustrate that GoT offers advantages over state of the art on different tasks, for example increasing the quality of sorting by 62% over ToT, while simultaneously reducing costs by >31%. We ensure that GoT is extensible with new thought transformations and thus can be used to spearhead new prompting schemes. This work brings the LLM reasoning closer to human thinking or brain mechanisms such as recurrence, both of which form complex networks.

Pal, Arka, Deep Karkhanis, Manley Roberts, Samuel Dooley, Arvind Sundararajan, and Siddartha Naidu. “Giraffe: Adventures in Expanding Context Lengths in LLMs.” arXiv. https://doi.org/10.48550/arXiv.2308.10882.

Modern large language models (LLMs) that rely on attention mechanisms are typically trained with fixed context lengths which enforce upper limits on the length of input sequences that they can handle at evaluation time. To use these models on sequences longer than the train-time context length, one might employ techniques from the growing family of context length extrapolation methods -- most of which focus on modifying the system of positional encodings used in the attention mechanism to indicate where tokens or activations are located in the input sequence. We conduct a wide survey of existing methods of context length extrapolation on a base LLaMA or LLaMA 2 model, and introduce some of our own design as well -- in particular, a new truncation strategy for modifying the basis for the position encoding.

We test these methods using three new evaluation tasks (FreeFormQA, AlteredNumericQA, and LongChat-Lines) as well as perplexity, which we find to be less fine-grained as a measure of long context performance of LLMs. We release the three tasks publicly as datasets on HuggingFace. We discover that linear scaling is the best method for extending context length, and show that further gains can be achieved by using longer scales at evaluation time. We also discover promising extrapolation capabilities in the truncated basis. To support further research in this area, we release three new 13B parameter long-context models which we call Giraffe: 4k and 16k context models trained from base LLaMA-13B, and a 32k context model trained from base LLaMA2-13B. We also release the code to replicate our results.

🗞️ Other Bits and Bytes

The week is still just getting started, which means that there’s still plenty of time to catch up on the news and links the AI community is talking about.

  • ⛏ Behind the rise and rise of AI’s ultimate picks-and-shovels company. How Nvidia Built a Competitive Moat Around A.I. Chips (Don Clark in the New York Times)

  • 💾 More than you ever wanted to know about Nvidia’s hottest AI processors. Nvidia H100 GPUs: Supply and Demand (GPU Utils)

  • 🃏 Making model cards digestible. AI Nutrition Facts, a project from Twilio, aims to boil down AI models to an FDA-style label. Twilio welcomes AI researchers to make a “nutrition label” for whatever models they are cooking up next.

  • 🍜 One to noodle on. AI and the Structure of Reasoning (Venture investor Jerry Neumann on his blog, Reaction Wheel) — “This post argues that there may someday be an AI that can improve itself beyond any bounds humans can imagine, but generative AI isn’t it. The current AI technology is not only not smarter than humans, it can’t improve itself to be smarter than humans. My argument relies on fundamental conceptual limitations on types of reasoning. This is important not just to bound what current AI can do, but to expand our thinking to what would need to be done to create true AGI.”

  • 🥵 A spicy take. AI Isn’t Good Enough (Paul Kedrosky and Eric Norlin writing in Irregular Ideas, a newsletter from SK Ventures) — “The trouble is—not to put too fine a point on it—current-generation AI is mostly crap. Sure, it is terrific at using its statistical models to come up with textual passages that read better than the average human’s writing, but that’s not a particularly high hurdle. Most humans are terrible writers and have no interest in getting better. Similarly, current LLM-based AI is very good at comparing input text to rules-based models, impairing the livelihood of cascading stylesheet pedants who mostly shouted at people (fine, at Paul) on StackExchange and Reddit. Now you can just ask LLMs to write that code for you or check crap code you’ve created yourself.”