AI Minds #2: The "Opening" of AI

What does "open source" mean for AI? The answer: tbh, TBD.

Welcome (back) to AI Minds, a newsletter about the brainy and sometimes zany world of AI, brought to you by the Deepgram editorial team.

In each edition, we’ll highlight a special topic from the rapidly-evolving field of Language AI. We’ll also touch on timely issues the AI community is buzzing about. As a heads-up, you might want to brace yourself for the occasional spicy take or two along the way.

Thanks for letting us crash your inbox; let’s party. 🎉

🌟 Get up to $100,000 in speech-to-text credits.
Join Deepgram Startup Program 🚀

🎁 The Opening of AI

This time around, we’re looking at the world of “Open” AI (and, no, it’s not the OpenAI you’re thinking of). Some language model providers prefer to keep things close to their collective chest, while others opt to flaunt almost everything under the aegis of “Open Source.”

🦙 Quick Looks at Llama

Meta’s Llama model family is not the first set of large language models (LLMs) that had the AI community commending its contribution to “open-source” models. But it (specifically the permissively-licensed Llama 2 model) did make a big splash, and it kicked off a broader debate about what openness means in the AI era. To be clear, it’s not “open source” according to the official Open Source Definition, but it is permissively licensed, and Llama nonetheless raised the bar when it came to performance and capabilities of open models. And for that reason alone, it’s worth a closer look.

Here at Deepgram, we’ve taken a couple swings at the Llama pińata, and here’s the analysis we shook out:

🤔 What’s So Complicated About “Open” AI

Much like with open-source software, open models present a number of advantages over proprietary, closed models like ChatGPT, Claude, Bard, et al.. Open models offer transparency, customizability, and accessibility to the underlying technology that closed models simply cannot.

All that being said, this is a tricky topic to tackle, taking into account that:

  1. There’s a long-established community around Open Source Software with its own cultural norms and expectations for viewing whatever purports itself to be “Open Source.”

  2. There’s some heated debate about what “Open Source” even means in the context of AI models, and the definition of “Open Source AI” is quite literally in the process of being written (but more on that in a bit).

If all this sounds complicated, that’s because it is.

The definition of “Open Source Software” is difficult to apply to AI, but why is that? Plainly stated: it’s because, when it comes to AI, there are more moving parts to consider.

To unpack this a bit, let’s compare what we’ll refer to as “Classic Open Source Software” vis-à-vis “Open Source AI Models”.

👴 Old-School Open Source

In the traditional sense, open-source software is what it says on the can: a published, widely available software package whose codebase can be reviewed and validated by anyone in possession of it. There is one thing to share, and that thing is the codebase. 

Take the extremely popular Python package, Pandas as an example. You are just one pip install away from using one of the most widely-used data analysis toolkits and gaining full-read access to its entire codebase—for free—which you can analyze, screen for security issues, and even contribute back to if you’ve got the programming chops to do so.

🔓 Open-Source AI Models

By contrast, for AI to be considered truly “open source” at least four things must be shared:

The code to build the model (aka “training code”)

The data used to build the model (aka “training data”)

The model weights (aka “the model”)

The code used to run the model (aka “inference code”)

For the record, this is just the current consensus. A more formal definition is being developed, but more on that in a bit.

💥 Breaking Open the Debate

In software land, Open Source means something very specific. Let’s peek at an abbreviated definition from the Open Source Initiative, which comes in 10 helpful parts.

Generally speaking, “open” foundation models are pretty good at adhering to some of these clauses, but other clauses in the Open Source Software Definition don’t align with how some foundation model developers release their models. Let’s take a look.

👯‍♀️ Where Open Models Generally Conform to “Open Source”

🚛 Free Redistribution. The license can’t restrict any party from selling or giving away the software program.

  • If a company wants to use Llama 2 (for example) as part of a software product, they’re allowed to do so under the community license. (With a notable exception we’ll mention later.)

👷 Derived Works. The license must allow for modifications and derivations of the work.

  • Most open models allow developers to fork and customize a model with additional training, fine-tuning, or other modifications.

🛒 License Must Not Be Specific to a Product. The rights of the license must not depend on the program being part of a specific software distribution.

  • This one is kind of self explanatory. But, to again use Llama 2 as an example, Meta cannot stipulate that the model can only be used in, say, a digital assistant but not a tutoring app for students. Provided that the terms of use aren’t violated, the model can be used in whichever way the developers figured out how to use it.

🪪 Distribution of License. The rights of the license must apply to all parties without the need for additional licenses.

  • If a developer builds an application with an open model, they’ve agreed to the license for that model. Any future user of the application or model would not, in turn, need to obtain permission from the open model’s creator to use or modify the model.

🌐 License Must Not Restrict Other Software. The license must not impose restrictions on other software distributed along with the licensed software.

  • In other words, if a developer uses an open language model as part of their application, they cannot say that all other applications distributed through the same platform have to have the same license as their application.

  • A real-world (but hypothetical) example: The developers of an image generation application for iOS can choose an open model like Stable Diffusion, but the Stable Diffusion license cannot (and does not) state that another application using a closed model like OpenAI’s DALL-E model should be barred from the App Store.

🖥 License Must Be Technology-Neutral. The license must not be dependent on a specific technology or interface style.

  • Example: The license for an open language model can’t require that the language model can only be used in a chat- or SMS-style interface.

🎻 Where Open Models Tend to Deviate from Classic “Open Source” Software

🧑‍💻 Source Code. Source code for the program shall be included and available for review by the user or licensee of the program.

  • Unlike other types of software, the code behind a foundation model is only part of what can possibly be shared. Sure, there’s training and inference code, but there’s also the underlying training data and the resulting model weights. Many “open” foundation models do not publish code and model weights and training data.

💪 Integrity of The Author’s Source Code. The license must allow the distribution of modified source code and permit software built from modified source code to be distributed.

  • Again, source code is not always shared openly, and many AI software licenses place some limitations on the distribution of code, either by explicitly forbidding it or by limiting use to a subset of potential users (ex. “for research use only”).

  • Code and model weights are sometimes available only to those who apply for access to them, limiting free distribution.

🧑‍🤝‍🧑 No Discrimination Against Persons or Groups. The license must not discriminate against any person or group of persons.

  • Some AI software licenses exclude certain types of users. A favorite example is Llama 1 and Llama 2, from Meta. Sure, it may be free to use (including for commercial purposes) for just about everyone, but not by companies with more than 700 million monthly active users. Those companies (like Snap, for example) would have to apply and presumably pay for a Llama license.

🌾 No Discrimination Against Fields of Endeavor. The license must not restrict the use of the program in any specific field of endeavor.

  • Due to the inherent risks of AI, many providers of open foundation models include acceptable use policies. Most of these stipulations are not too controversial, like forbidding the use of generative AI outputs to promote self-harm or foment political unrest. But they are stipulations nonetheless.

🌶️ Takes and Takeaways

  • AI Weights Are Not Open "Source" (Sid Sijbrandij publishing on the blog of Open Core Ventures) — 🥡 The Takeaway: AI weights should not be called "open source" as they are not source code, and suggests using the terms "Open Weights" for unrestricted usage and "Ethical Weights" for licenses with usage restrictions.

  • The Myth of ‘Open Source’ AI (Will Knight for Wired) — 🥡 The Takeaway: While models like Llama 2 may be branded as "open," they still come with significant restrictions that benefit big tech companies, hindering the democratization and accessibility of AI.

  • How Open-Source Software Shapes AI Policy (Alex Engler for The Brookings Institution) — 🥡 The Takeaway: Given that this article was written in 2021, its point was prescient. Open-source software plays a central role in the development and use of AI, but its importance is often overlooked in AI policy discussions.

  • Open source is good for AI but, is AI good for open source? (Terence Eden publishing on the blog of the British Computing Society) — 🥡 The Takeaway: Open source is beneficial for AI development, but proper attribution and understanding of code are essential to avoid legal and reliability issues.

💡 Abstract Insights

Everyone loves a good academic paper, right? Right? Well, even if you don’t have time to read the whole thing, a lot can be gleaned from a paper’s abstract.

This week, we’d like to highlight one of the papers that served as the basis for the “acceptable use” policies that often come with many AI models today, both open- and closed-source.

Contractor, Danish, Daniel McDuff, Julia Haines, Jenny Lee, Christopher Hines, Brent Hecht, Nicholas Vincent, and Hanlin Li. “Behavioral Use Licensing for Responsible AI.” In 2022 ACM Conference on Fairness, Accountability, and Transparency, 778–88, 2022. https://doi.org/10.1145/3531146.3533143.

With the growing reliance on artificial intelligence (AI) for many different applications, the sharing of code, data, and models is important to ensure the replicability and democratization of scientific knowledge. Many high-profile academic publishing venues expect code and models to be submitted and released with papers. Furthermore, developers often want to release these assets to encourage development of technology that leverages their frameworks and services. A number of organizations have expressed concerns about the inappropriate or irresponsible use of AI and have proposed ethical guidelines around the application of such systems. While such guidelines can help set norms and shape policy, they are not easily enforceable. In this paper, we advocate the use of licensing to enable legally enforceable behavioral use conditions on software and code and provide several case studies that demonstrate the feasibility of behavioral use licensing. We envision how licensing may be implemented in accordance with existing responsible AI guidelines.

Contractor et al. (2022)

📖 Defining the Future of Open Models

Although there isn’t a clear-cut definition of “open-source” AI, that’s changing, and soon.

🌱 Early Inroads in AI-Specific Open Licensing

It should come as no surprise that all coauthors of the above-mentioned research paper are the founding members of the Responsible AI License (RAIL) initiative.

Established in 2019, the initiative created the RAIL and OpenRAIL family of licenses. Open RAIL licenses are in many ways similar to other open-source software licenses, but with a twist: it appends a set of use restrictions the initiative states are meant to reduce AI risks.

Here’s the current lay of the Open RAIL land:

  • Open RAIL-D includes use restrictions on data.

    • Note: This license is still a work in progress. RAIL is hosting a workshop on October 15, 2023 to discuss data licensing.

  • Open RAIL-A includes use restrictions on the application. (Functionally, it’s an end-user license agreement.)

  • Open RAIL-M includes use restrictions on the model.

  • Open RAIL-S includes use restrictions on source code.

Open RAIL licenses have been used by research projects like OpenScience’s BLOOM language model, and by commercial providers like Stability.ai. Variants of the Open RAIL license are among the most-used licenses for models on HuggingFace.

🪨 Setting Something in Stone: All Eyes on October

The Open Source Initiative—the nonprofit standards body which sets and maintains licensing standards in the open-source software community—is in the midst of a Deep Dive into the question of what, exactly, “open source” means for artificial intelligence.

The process started back in June, but here’s the timeline over the next few months:

  • 📅 September 19-21. Community Review #3 takes place in Bilbao, Spain at the Linux Foundation Open Source Summit.

  • 📅 September 26-October 12th. OSI will be hosting a series of webinars featuring speakers involved in the process of crafting a definition for “Open Source AI”. These folks include RAIL co-founder Daniel McDuff, Microsoft senior corporate counsel for Open Source, Standards, and Machine Learning Mary Hardy, and Creative Commons board member Luis Villa, among others. Signing up to view the webinars live is free.

  • 📅 October 17. At the All Things Open conference in Raleigh, NC, OSI will collect feedback and prepare a release candidate of the Open Source AI Definition.

Online comments will remain open, and the review process will continue into 2024.

In the same way, OSI sets the authoritative definition of open-source software, it’s hoped that their definition of Open Source AI will clarify what’s so far been more than a little nebulous. 😶‍🌫️