It's called openwashing, and it's an AI practice that can put you, your customers, and even AI itself at risk.
If you are one of the many developers and others who watch the field, you know that new generative AI models are released literally every day. Some are relatively small and related to a particular niche, and others are more than deserving of the name Large Language Models, and make the news. Some, like ChatGPT, Claude, and MidJourney, are clearly and proudly proprietary. Most, however, like Llama, Falcon, and StableDiffusion, are labeled as "open" or "open source."
And why not? In the world of AI and machine learning, there are real advantages to an open model, including:
- Transparency: Open models offer clear insight into their development, enabling greater trust and accountability.
- Reproducibility: With full access to data, code, and documentation, results can be consistently replicated, so there's no question as to what's actually happening behind the scenes.
- Collaboration and Innovation: Open models encourage community-driven improvements and advancements as one model builds on another.
- Security and governance: Community scrutiny helps identify and address vulnerabilities and biases, enhancing model reliability.
But when a model purports to be open and it isn't, that's openwashing. The Linux Foundation AI & Data Generative AI Commons has developed the Model Openness Framework to help you avoid it, but before we get into that, let's talk about openwashing and why it's such a problem.
What is openwashing?
In my role leading the Models and Data workstream of the Linux Foundation AI & Data Generative AI Commons, I spend a lot of time in meetings with MIT's Irving Wladawsky-Berger. Irving has been around forever and often regales us with tales of what it was like at the beginning of the open-source era, when companies that made software tried to demonize open source as dangerous and a security risk.
These days, of course, we know that's ridiculous; In fact, if it wasn't for open source, you probably wouldn't be able to read this article. Open source is everywhere, from the servers that run the internet to the databases and programming languages used to create websites. Open source has become a legitimate and widely accepted way of distributing and consuming software. It's impossible to dispute its crucial role in revolutionizing software distribution and consumption and promoting innovation, transparency, and collaboration.
Sometimes software appears to be open source, in that it's freely available – but it's not. For software to be open source, you need to be able to take it, modify it, and do what you like with it. Sometimes you can't do that because the source isn't available, or it's available but licensing curtails what you can do.
Just as companies that make misleading claims about the environmental benefits of their products are said to be "greenwashing", companies who claim their software is open source when it's not are "openwashing."
How Openwashing Happens to AI Models
When it comes to AI models, there are even more opportunities for openwashing than with software. In some cases, the model producers themselves don't even know that they're openwashing.
Completeness
Typically, openwashing starts when only selective components of an AI model are released. For example, a producer will release the model architecture -- the actual code that makes it work -- but crucial elements such as the complete datasets used for training, along with in-depth documentation, may not be considered "necessary," or may be released but not in a location that's immediately obvious.
When not only the training data but also the code used to preprocess that data and evaluate the model are wholly or partially inaccessible, that model is incomplete, Not only is it impossible to verify the contents of the data, it's impossible to duplicate the researchers' results. As in any kind of scientific research, Results can only be considered legitimate when they are fully repeatable .
Openness
A common openwashing tactic involves using licensing terms that superficially suggest openness. However, these licenses may contain restrictive clauses that prevent users from modifying, sharing, or using the model in various ways, such as in commercial projects. These kinds of restrictions are not just against the principles of open-source, but they directly prevent the kind of innovation and advancement those principles were meant to foster.
Fortunately, in many, or even most cases, this kind of openwashing isn't deliberate. Many model producers, accustomed to open-source software, actually do think they're doing the right thing. And even with the best of intentions, meeting these two standards alone for AI models does not suffice.
Why “Open Source” Doesn’t Work for AI Models
The concept of open source faces unique challenges when applied to AI models, which can be broken down into several key issues:
- Data sets aren't software: While the architecture of a model consists of software, and is thus appropriately licensed by open source licenses such as Apache 2.0, the data sets that underlie that mode -- as well as the model weights that determine this behavior -- are not. That means that producers are not just making a mistake when they use that same Apache 2.0 license for those kinds of components, they're creating a legal black hole into which model consumers can easily fall. Instead, these components should be labeled with licenses that are designed more for content, such as Creative Commons Attribution Share Alike (CC-BY-SA) or even data-specific licenses such as CDLA-Permissive-2.0.
- Data Dependencies: AI models are often dependent on proprietary or confidential datasets that cannot be shared openly due to privacy concerns or intellectual property rights. This restricts the ability to fully open-source the entire model, as critical components remain inaccessible.
- Complex Integration Needs: Unlike most software applications, AI models frequently need to be integrated into existing systems with specific configurations. This integration can be complex and requires careful tuning, which is difficult to manage in a fully open-source framework without extensive documentation and support that often goes beyond the code itself. Often , this information comes in the form of a research paper that explains exactly how the model producer got their results and provides enough information to replicate the effort.
Even so, in our conversations with model producers, we've found that they often don't realize that there even is an issue. For example, they may know that they need to produce a research paper or technical report to explain what they've done, but when it comes to the training data, they're satisfied that it exists somewhere. Producers may be satisfied, but it fails the essential scientific principle of repeatability.
The worst of both worlds
Unfortunately, openwashed models present the same problems as closed models, with none of the benefits of being open, including:
- Questionable Data: Unverified data used to create a model may be biased; it may contain private information; or, it may contain unlicensed or copyrighted material that can lead to model output results that are harmful or even legally questionable.
- Dubious Model Evaluation: Without complete transparency, the methods used to evaluate these models often lack rigor and can be biased toward presenting favorable results. This makes it difficult for users to trust the accuracy and effectiveness of the model, as independent verification of performance claims is not possible.
- Generation of disinformation or illegal content: Without understanding how a model works and the information on which it is based, there are no safeguards against creating a model that intentionally or unintentionally provides false information or enables users to perform illegal actions. A model can even be constructed so that these attributes only appear in models derived from the foundation model.
Each of these aspects significantly undermines the trustworthiness and utility of a particular AI model.
How does openwashing threaten me and my company?
So why does all of this matter? After all, most of those risks are the same for proprietary models. Well, that's true, but while proprietary models do carry many of these same risks, openwashed models add a brand new unforeseen consequence.
Legal and Compliance Risks
When using openwashed models, it introduces significant legal and compliance risks to any company using them for AI work. For example, some model producers take datasets or code that isn't open source and re-release it under an open-source model. At that point, you're using a model that's legally questionable (at a minimum, violating license terms) , even if it doesn't output copyrighted information.
Then there are the compliance issues. Using openwashed models can violate industry-specific regulations, leading to fines and damaging the company's standing with regulatory bodies. What's more, AI models that have been trained on sensitive data could lead to inadvertent data exposure. This is particularly problematic in fields like healthcare, where patient data confidentiality Is subject to strict regulatory standards like HIPAA, among many others.
Reputational Damage
Without understanding how a model was built and trained, you really don't have a good handle on what kind of data it's going to output, potentially leading to a significant reputational and public relations embarrassments (or worse). (Think Microsoft Tay, but without being able to control the outcome.)
Just the discovery that your company is using a model it's not legally entitled to use -- even if no actual harm has taken place -- can directly damage your company's reputation, eroding trust among customers and partners. No serious commercial enterprise would intentionally rely on stolen IP or count on getting away with plagiarism, but this is not a farfetched risk here. The cost of rebuilding trust is high, costing significant time and resource. Ensuring honesty in model sourcing and usage can mitigate these risks, but the financial implications still loom large.
Financial Implications
It may seem counter-intuitive that a supposedly free model can end up costing you money. And yet, aside from the obvious issue of potential litigation, there are real operational issues to consider. When implementing an LLM at a customer, there are tasks that are specific to that model; if it turns out that model isn't appropriate, not only is the money spent implementing that model wasted, one must then take into account the unplanned switching costs of having to start over and implement something else.
How does openwashing threaten my customers?
Your use of openwashed models can impact your customers in a number of different ways.
Some of those are the same as the risks to your own company. For example, a model based on copyrighted data can output copyrighted data for users not entitled to it, which puts your customers at risk. On the other hand, there are additional risks that will have cascading impact on customers who use the tool you built with the model.
Security Vulnerabilities
Openwashing can significantly undermine customer security. When models lack thorough security audits due to false openness claims, hidden vulnerabilities may remain unaddressed. These vulnerabilities can be exploited by malicious actors, compromising customer data and privacy. Is the model giving out misinformation? Without a truly open model, there's no way to know.
Bias and Ethical Concerns
Models that are not transparent may incorporate biases that lead to unfair or discriminatory outcomes against certain customer groups. These biases can erode customer trust and lead to ethical dilemmas. Your customers may even be passing on bad decisions to their own customers, subjecting your customers to all of the issues you'd be dealing with as well as their own.
Data Privacy Issues
Data privacy is a critical concern for customers when interacting with AI models. Openwashed models might use or share customer data in ways that are not transparent, increasing the risk of breaches and misuse. Is the model sending submitted data to a foreign country? Is it incorporating that submitted data into other users' queries?
How does openwashing threaten AI itself?
It may sound dramatic to talk about openwashing being a threat to AI itself, but it's not. Remember Irving and his stories about the early days of open source, when proprietary companies were trying to scare people out of using open source software? The same thing is happening now, as various countries and legal jurisdictions try to restrict AI and how it's allowed to function.
Losing access to open models
If AI that is truly open falls by the wayside, we will lose the advantages that it provides, such as the ability for companies to build on each other's work. In addition, small companies that can't afford to build their own models will be priced out of the market. Add to that the companies that will be forced away from using AI at all because of prohibitively high usage fees of the proprietary models, and the entire AI industry slows and potentially comes to a standstill. In economic terms, that means privatizing the gains and socializing the losses.
Erosion of Trust in AI
Another risk: perhaps we won't lose access to open models, but because it's difficult or impossible to find any that are truly open, users will stop trusting any models at all. When AI models are presented as open but fail to meet these standards, it fosters skepticism about the authenticity and reliability of open AI initiatives across the board.
Stifling Innovation
The the most obvious issue? Models that are only openwashed but not truly open can do real harm to the collaborative spirit intended to drive AI innovation. When models are not genuinely open, researchers and developers are restricted in their ability to examine, modify, and improve upon existing work. This limitation both slows down the pace of innovation and also restricts the diversity of ideas and solutions that can emerge.
How can I avoid openwashing?
The short and obvious answer to avoiding use of openwashed models is to make sure that the models you use are truly open. But that's easier said than done. For example, as I mentioned earlier, sometimes a model is misrepresented, sometimes unintentionally.
The Linux Foundation AI & Data Generative AI Commons has addressed this challenge head-on by standardizing the notion of "open" when it comes to models. The Model Openness Framework specifies three levels of openness a model can reach:
By creating a standard way of defining what is an "open" model, the Model Openness Framework provides a uniform means for model producers to ensure that they're doing the right thing.
To help determine the level to which a model belongs, the Generative AI Commons has also produced the Model Openness Tool, which provides a diagnostic questionnaire that enables producers to clearly characterize the status of the various pieces of their model and get back an openness rating.
One thing to keep in mind, however: at least for now, the Model Openness Tool is a self-reporting tool, so once you've found a model whose rating makes you comfortable, do your own due diligence to verify those pieces.
Start with three simple questions: Look at the dataset's lineage. Verify the licensing. Make sure there's comprehensive documentation.
But don't stop there: There's a virtual cycle of innovation, and it relies on you to engage with the community: ne of the hallmarks of genuinely open projects is an active community of contributors and users. Engage with these communities through forums, discussions, and by reviewing shared modifications and developments. Community engagement can provide insights into the reliability and ethical use of the model, and whether its open nature is substantiated by robust user and developer activity. By taking these steps, you can safeguard your interests and contribute to a more trustworthy AI ecosystem.