Welcome (back) to AI Minds, a newsletter about the brainy and sometimes zany world of AI, brought to you by the Deepgram editorial team.
In each edition, we’ll highlight a special topic from the rapidly-evolving field of Language AI. We’ll also touch on timely issues the AI community is buzzing about. As a heads-up, you might want to brace yourself for the occasional spicy take or two along the way.
Thanks for letting us crash your inbox; let’s party. 🎉
While large language models are all the rage, the topic of smaller language models and their benefits is gaining momentum. Let’s kick off the conversation with two articles we recently wrote on the topic:
Why Bigger Isn’t Better for Language Models, also by Xian “Andy” Wang — Learn how the high costs associated with large models like GPT-4 might be driving users towards more affordable options without compromising on performance.
🧩 Unpacking the Language Model Puzzle
In the world of language AI, the trade-offs between bigger and smaller language models usually boil down to two words: Performance and Efficiency. Both contain multitudes, so let’s unspool ‘em a bit. 🧵
🏎️ LM Performance
To borrow a bit of technical jargon, what does it mean for a language model to be performant? To say the least, it’s multifaceted.
Accuracy. Quite simply, how often the model gets it right, keeping in mind that “right” depends on the context. See also from AI Minds #1: Performance on language model benchmarks.
Speed (Latency). How quickly the model produces an answer.
Robustness. The model’s stability under unpredictable conditions, such as dealing with noisy input data or even adversarial attack.
Generalization. How well the model can apply its “learning” to new, unseen data.
Throughput. How many queries the model can process in a given period of time.
📈 LM Efficiency
When it comes to evaluating language model efficiency, suffice it to say that measures are a little more multidimensional.
Computational Costs. Recalling that there are two key phases of the machine learning model life cycle:
Hardware Requirements. Broadly speaking, there are three categories of hardware requirements to consider. Keep in mind that “efficiency” in this context is the extent to which less hardware can achieve high performance.
Compute. Usually in the form of graphics processing units (GPUs), but can also include specialized hardware like TPUs, but CPUs can also factor into certain training and inference situations.
Memory Footprint. The amount of RAM required during training (due to factors like batch size) and inference.
Storage. The amount of disk space required to store the model’s weights and architecture, as well as storage requirements for iteratively saving model states (e.g. “checkpoints”) during training.
Training Duration. The time it takes between inception of a model and when it’s deemed ready to deploy to production. This can consist of two steps: pre-training on the base dataset, and fine-tuning the model on a smaller task-specific subset of data.
Data Requirements. How much data is required to train the model? This can vary widely depending on a number of variables, ranging from the size of the model to the task- or domain-specificity of the fine-tuning step.
Power Consumption. How much electricity does it take to train or run the model in question? It depends. Power consumption is a factor in Computational Costs, but this metric is vitally important when talking about inference on the edge, such as local image generation or voice transcription functions on smartphones or laptop computers.
Carbon Footprint. The environmental cost, expressed in CO2 equivalent emissions, incurred as a result of training and deploying the model in question. This is largely a function of the blend of electricity sources used to power training and inference. For example: A data center powered by hydroelectric or nuclear power will build and run models with a smaller carbon footprint than a comparable data center hooked up to a coal-fired power plant.
🔮 Does Fortune Favor the Small (Fry)?
When it comes to the current application landscape, the short answer is, more often than not, yes. Sure, larger language models are typically more capable, but in real-world implementations, that capability comes at a cost, which hearkens back to the trade-off between performance and efficiency highlighted above.
So, what paradigm wins? The accurate if dissatisfying short answer is: it depends. If you’re a student or programmer just faffing around to see if GPT-4 can take some work off your plate on a one-off basis, by all means, use the largest language model you can summon. But if you’re building for a close-to-real-time use case or one requiring high accuracy for an incredibly niche task, a smaller model may be tailored to fit the purpose.
🍔 In other words, one doesn’t need to deploy a GPT4-sized language model to handle ordering for a hamburger restaurant, but if you’re looking for help to develop an ontological understanding of what even is a hamburger, you’ll get the best analysis out of a larger language model vis-à-vis a smaller one. How do we know? We tested GPT-3.5-Turbo against GPT-4. (See slightly truncated screenshots below, both inferred by the September 25th version of each model.)
Here’s GPT-3.5-Turbo’s reply:
David Models 🐇 vs. Goliath Models 🦣: Which One Wins?
Speaking of ontology, before the brief fast food interlude, we roughly mapped the playing field of Performance and Efficiency for language models. How do small versus large language models stack up?
Spoiler alert: smaller language models tend to win out when it comes to performance and efficiency in deployment environments.
🐇 Advantages of Smaller Language Models:
Speed and Efficiency: They're faster for inference, making them suitable for real-time applications.
Deployment: Easier to deploy on resource-constrained environments like mobile devices or edge devices.
Reduced Computational and Storage Needs: Consumes less power and memory.
Economic: Training and deploying smaller models are generally cheaper in terms of computational and energy costs.
Less Prone to Overfitting: On smaller datasets, smaller models might generalize better.
Interpretability: They can be easier to analyze and interpret compared to their larger counterparts.
🦣 Advantages of Larger Language Models:
Performance: They often outperform smaller models on a wide range of tasks, capturing subtle nuances and intricate patterns.
Knowledge Retention: They can retain vast amounts of general knowledge while still being fine-tuned for specific tasks.
Robustness: Better generalization across diverse tasks and datasets.
Richer Representations: They can develop more profound and more nuanced embeddings for data, leading to better semantic understanding.
Few-Shot or Zero-Shot Learning: Larger models have demonstrated the ability to generalize from very few examples or even without any explicit examples of certain tasks.
An abstract is basically the TLDR version of an academic paper, which is handy since 1. Many academic papers are being published about language models these days, and 2. Most papers go bushwhacking through technical minutia and a lot of math… which is to say that they can be in the weeds.
This time around, we’ve got not one but three abstracts to share for those of you who want to do some more intellectual spelunking through the growing literature covering the mightiness of smaller language models.
In alphabetical order by author:
The recent GPT-3 model (Brown et al., 2020) achieves remarkable few-shot performance solely by leveraging a natural-language prompt and a few task demonstrations as input context. Inspired by their findings, we study few-shot learning in a more practical scenario, where we use smaller language models for which fine-tuning is computationally efficient. We present LM-BFF--better few-shot fine-tuning of language models--a suite of simple and complementary techniques for fine-tuning language models on a small number of annotated examples. Our approach includes (1) prompt-based fine-tuning together with a novel pipeline for automating prompt generation; and (2) a refined strategy for dynamically and selectively incorporating demonstrations into each context. Finally, we present a systematic evaluation for analyzing few-shot performance on a range of NLP tasks, including classification and regression. Our experiments demonstrate that our methods combine to dramatically outperform standard fine-tuning procedures in this low resource setting, achieving up to 30% absolute improvement, and 11% on average across all tasks. Our approach makes minimal assumptions on task resources and domain expertise, and hence constitutes a strong task-agnostic method for few-shot learning.
⚡ Kaplan, Jared, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. “Scaling Laws for Neural Language Models.” arXiv, January 22, 2020. https://doi.org/10.48550/arXiv.2001.08361.
We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude. Other architectural details such as network width or depth have minimal effects within a wide range. Simple equations govern the dependence of overfitting on model/dataset size and the dependence of training speed on model size. These relationships allow us to determine the optimal allocation of a fixed compute budget. Larger models are significantly more sample-efficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence.
When scaled to hundreds of billions of parameters, pretrained language models such as GPT-3 (Brown et al., 2020) achieve remarkable few-shot performance. However, enormous amounts of compute are required for training and applying such big models, resulting in a large carbon footprint and making it difficult for researchers and practitioners to use them. We show that performance similar to GPT-3 can be obtained with language models that are much "greener" in that their parameter count is several orders of magnitude smaller. This is achieved by converting textual inputs into cloze questions that contain a task description, combined with gradient-based optimization; exploiting unlabeled data gives further improvements. We identify key factors required for successful natural language understanding with small language models.
A Slight Continuation of the Above. Back in May, New York Times contributor Oliver Whang covered the Race to Make A.I. Smaller (and Smarter) by way of the BabyLM Challenge. It (e.g. the challenge) invites a shift from colossal language models to compact, efficient ones, fostering a more inclusive and human-compatible AI landscape. Explore how this project could reshape the narrative around language model development and its implications on understanding human language learning.
Mapping the Generative AI Market. In a follow-up to last year’s look at the creative new world of Generative AI, Sequoia Capital partners Sonya Huang and Pat Grady linked up with GPT-4 to co-author Generative AI’s Act Two. The TLDR-est of TLDRs: if Act 1 was “technology-out,” then Act 2 is “customer-back.” And for those interested in seeing how the Sand Hill Road stalwart segments the market, the article has some handy market maps too.
The Longer Llama that Could. Meta unveils Llama 2 Long, a refined AI model excelling in managing lengthy user prompts. With a subtle yet crucial tweak and enriched training data (e.g. an additional 400 billion-token fine-tuning dataset), this new model outperforms competitors GPT-3.5-Turbo and Claude 2 on many tasks, reaffirming Meta's commitment to more open (but not quite open-source) AI advancements.