AI risk management roadmap: mastering intellectual property agreements when building AI products
How to prevent data leakage and be clear about who owns what.
Today’s post is a collaboration with Joel Lehrer at the law firm Goodwin Procter LLP. The information in this article is strictly for educational purposes, and does not constitute legal advice.
Uncertainty over intellectual property (IP) ownership and content usage are issues that present roadblocks to the enterprise-wide adoption of Artificial Intelligence (AI) adoption. Much of the public conversation has centered around copyrights and whether the use of content to train Large Language Models (LLMs) represents “fair use” as well as whether the outputs of using LLMs themselves carry copyright protection. Litigation such as the recently-filed case between the New York Times and OpenAI, which may provide some guidance to these questions, are likely to take some time to work themselves out.
At the same time, major technology companies offer indemnification in some customer agreements in an attempt to provide some comfort to their users. These have not yet been aggressively tested and it remains to be seen what impact they will have.
In the midst of this uncertainty, businesses are being pressured to adopt new generative AI tools and need to make decisions on how best to move forward while limiting risks. In this article, we propose some concrete ways companies can:
Prevent unintended exposure of confidential data when developing AI models.
Reduce uncertainty about the IP ownership of the products of AI training.
We provide a skeleton of key terms and concepts that can be included in your contract, which are italicized.
Set the scope
Specifically, we will focus on a contractual agreement (“Agreement”):
with two Parties (individually “Party”);
who are not conceivably competitors;
with roughly equivalent bargaining power;
where the IP described in the contract has uncontested ownership;
an AI model will be trained (and owned) by one party using the other’s data; and
the model will be available to users beyond those who provided the training data.
Some examples of this would be:
A healthcare company uses patient data to build an AI model to diagnose cancer for its general population.
A manufacturing company provides machine telemetry to an Internet of Things (IoT) software company to train a predictive maintenance model with the data.
A publishing company - to whom all of its writers have licensed or transferred the copyright to their books - further licenses its materials to an AI company to build a chatbot which mimics the style of certain of its authors.
As Walter and
explained in a previous article, properly drafted non-disclosure agreements can help protect against unintended training of AI models on confidential information - a position most organizations will generally want as a default. But as the above examples illustrate, there may be scenarios where it is in the interests of both parties to train a model on one of the party’s data. Using the above examples:Patients get better treatment using a model trained on a large, diverse corpus of data.
Manufacturers can implement more accurate maintenance schedules when an algorithm is trained on their data.
Customer experience with the chatbot is enhanced because of the richer content on which it was trained.
In the first two cases, there may very well be a business need to make the resulting models, trained on confidential information, available to third parties not under an obligation of confidentiality, e.g., to other customers.
And in the case of the publishing company building a chatbot, it would be critical to discern what data could be used to train a generative AI model, such as the text of its books, and what data cannot, such as projected sales of the textbooks and chatbot.
Boilerplate confidentiality and intellectual property assignment clauses often fall short here. More detail is needed.
Define key terms
Distinguish between predictive and generative AI
There are many different terms used to describe categories and modalities of AI and machine learning, such as:
Supervised learning
Unsupervised learning
Natural language processing
Exact definitions for each of these terms may vary slightly based on the context in which they are being used. For contractual assignment of or licenses to intellectual property, however, having clear definitions is crucial. Toward that end, we propose identifying two types of AI systems: predictive and generative.
Predictive Artificial Intelligence (“Predictive AI”) means a system, model, or algorithm that generates as its output only numbers, categorical labels, or combinations thereof. This includes but is not limited to techniques such as:
Linear regression
Logistic regression
k-means clustering
Generative Artificial Intelligence (“Generative AI”) means any system, model, or algorithm that generates as its output anything more than numbers, categorical labels, or a combination thereof. This includes but is not limited to:
Large language models
Image generation models
There is some overlap here, and that is by design. Specifically, Generative AI that also performs Predictive AI work would be classified as the former.
The main reason having these definitions is that Predictive Artificial Intelligence will unlikely be able to produce data that anyone would consider to be - by itself - sensitive or confidential. Generative Artificial Intelligence, on the other hand, almost certainly would produce sensitive outputs.
Defining training data
“Training Data” means any information which either party uses to develop Predictive AI or Generative AI contemplated by this Agreement.
This is admittedly a very broad term and could include things like:
the Python code used to build the model;
infrastructure-as-code for running it; or
data used to generate model weights.
Its broadness is a feature, not a bug, though, and we’ll use it narrowly later on.
Confidential and Non-Confidential Proprietary Information
Many agreements include a definition like the following:
“Confidential Information” is all information provided by one party to another, whether of a technical, business, financial, or any other nature, disclosed in any manner, whether verbally, electronically, visually, or in a written or other tangible form, which is either identified or designated as confidential or proprietary or which should be reasonably understood to be confidential or proprietary in nature.
This is broad, but will work for our purposes. A key point here is that it may be convenient to create another category of information that:
is not confidential;
belongs to one party (only); and
the contract allows and regulates on which the training of AI models is permitted.
We’ll call this “Non-Confidential Proprietary Information,” and define it below.
“Non-Confidential Proprietary Information” is anything that is a(n):
discovery, development, concept, design, idea, know how, improvement, invention, or original work of authorship; and
to which a Party owns the right, title and interest and all patent, copyright, trademark, or other intellectual property rights therein; and
which is not Confidential Information.
Prevent data leakage by specifying training techniques and data sources
We can now accomplish our first key objective: laying out how and on what data the Parties may train AI models while preserving the confidentiality of certain data. Here are some example scenarios:
A Party may develop Predictive AI where its Training Data includes the other Party’s Confidential Information.
A Party may develop Generative AI where its Training Data includes the other Party’s Non-Confidential Proprietary Information.
Neither Party shall develop Generative AI where its Training Data includes the other Party’s Confidential Information.
Using our three example situations, the party training the model may be able to accomplish the following goals:
Build a prediction algorithm that provides a likelihood a given X-ray depicts a tumor that will become cancerous using anonymized images from real patients. The training entity could not, however, build an image generation model that creates pictures resembling X-rays with (pre-)cancerous tumors based on specific patient images, as one may be able to identify the particular patient to whom a given X-ray belongs. This is an appropriate restriction that still permits the system to provide its primary benefit.
Build a classification algorithm that labels a machine part as being ready for replacement based on how long it has been in service and its usage. In fact, many businesses may not be concerned if such a system (without the training data) were used outside of their control. The training party would not, however, be able to train or fine-tune an LLM on the entire corpus of all telemetry outputs from a set of machines to build a chatbot, because prompt injection attacks which force models to reveal their training data verbatim are quite trivial to build and deploy.
Build a chatbot based using the licensed works of the publisher’s authors, because such works would be defined as Non-Confidential Proprietary Information. Creating an LLM using the email communications from negotiations between the AI company and the publisher, however, would be prohibited.
The broadness of the definition of “Training Data” means that any inclusion of a Party’s Confidential Information or Non-Confidential Proprietary Information in building a model would, by design, trigger the above restrictions. It removes any lack of clarity regarding AI systems that are trained “just a little bit” on the other Party’s data by using unambiguously defined terms.
Assign data and AI model ownership
A contract should also clearly spell out who owns the products of the AI training. To simplify matters, we will assume sole ownership of the finished models - and an equal trade between two Parties, A and B.
“Owner” is the Party having sole right, title, and interest as well as all applicable patent, copyright, trade secret, trademark, and other intellectual property rights.
The Parties agree that:
Party A shall be the sole and exclusive Owner of all Predictive AI developed pursuant to this Agreement and where the Predictive AI Training Data contains Party B’s Confidential Information; and
Party B shall have Ownership of all Generative AI developed pursuant to this Agreement and where the Generative AI Training Data contains Party A’s Non-Confidential Proprietary Information.
Each Party shall remain Owner of its Confidential Information and Non-Confidential Proprietary Information.
Additional considerations for IP assignment
No contract - or even fragments thereof - will ever be a silver bullet applicable to all situations. Business situations and individual Party objectives are likely to vary widely. Thus, customizing the language to meet your (client’s) use case is always advisable. Below are some additional considerations toward that end.
Allowing a vendor to train on your confidential data may help competitors, but will also help you
Companies building AI products need real-world data to make them effective. And these companies need to sell their products to more than just one customer to survive.
Thus, its possible - even likely - that a vendor building a model using your Confidential Information might make the model (without underlying data) available to a competitor. Doing is likely to help the competitor out.
While this might seem like a show-stopper issue at first, consider other things that companies regularly do to improve vendor products (which can also help competitors):
Participating in case studies showing how to use the product in a certain industry.
Providing feedback on new features.
Reporting bugs.
AI training is a similar, albeit a newer development.
The reason companies do all of the above, however, is that they expect the reward from improved functionality to outweigh the risk of helping competitors. And with AI training, getting the model trained on your data might be mean the product is more customized for your needs. So you might as well move first.
Clearly communicate your IP assignment provisions, especially in a B2C context
In early 2023, Zoom quietly made some changes to its terms and conditions with respect to its right to train AI models on customer content. Major backlash ensued later that year when a security researcher highlighted the changes. Zoom quickly backpedaled and changed their terms several times subsequently.
Although they did originally make expansive claims on their ownership of and ability to train on customer data, it doesn’t appear that they were following through on these claimed rights but were rather doing more narrow AI training.
From a public relations perspective, though, this more limited scope didn’t end up mattering. Reputation damage occurred rapidly and was very difficult to undo.
The lesson here is that firms seeking to train AI on customer content should be very explicit in their messaging, anticipating likely questions and concerns. This is especially true for business-to-consumer (B2C) contexts. Here, customers are unlikely to do detailed contract review but will instead rely on influencers to alert them to practices outside industry or societal norms.
How you assign ownership can reasonably shift based on the relationship between the parties
While the sample language we provide is most appropriate for near-peers in non-competitive industries, different terms might be more appropriate in different cases.
Firms with potential competitive overlap may want to prohibit AI training of any kind on Confidential Information, even using Predictive AI.
Companies with superior bargaining power may simply opt customers in to AI training on all data they submit, even Confidential Information. Some examples of this include:
As the Zoom case demonstrated, however, being clear about your policies without burying them in legalese is vital to effective public relations.
Consider specifying the exact types of algorithms to be used and data to be trained on
Instead of separating models into Predictive AI and Generative AI and splitting data into Confidential Information and Non-Confidential Proprietary, it might make sense to be even more explicit.
For example, if you are concerned about a counterparty using your data for k-means clustering for some reason but are okay with them doing linear regression on it, make that clear.
Similarly, in the case of machine telemetry you might specify that vibration data can be trained on but that power usage cannot be.
Getting this detailed may complicate negotiations. But it may be appropriate when there is a sufficiently large financial reward with a countervailing risk of misuse of the final AI model(s).
Exclude third-party intellectual property and get (or affirmatively choose not to get) warranties or indemnities about infringement
We are intentionally not wading into the discussion about AI training on third-party IP, especially copyrighted material. With that said, it is important to appropriately manage the risk of its inclusion into your data supply chain. Options include:
Getting warranties from a counterparty that it owns all of the IP in question.
Having the counterparty indemnify you against any third-party claims, as organizations like OpenAI, Microsoft, Google and others have done in certain cases.
Accepting the risk of litigation on the grounds that your AI model infringes third-party material.
Some combination of the above.
Understand the regulatory implications of training on personal data
If it is possible to de-anonymize individual people from the finished model(s) (whether Predictive AI or Generative AI), additional regulatory protections likely apply. Some key types of protected data include:
Protected health information (PHI), as defined by the U.S. Health Insurance Portability and Accountability Act (HIPAA).
Personal data, as defined by the European Union (EU) General Data Protection Regulation (GDPR).
Personally identifiable information (PII), as defined by various U.S. state laws and federal regulations.
Ensure you have the right safeguards in place - contractually and from a technical perspective - if your resulting model meets the criteria for these or other types of protected data.
Conclusion: moving the state of the art forward, understanding that gaps remain
In providing this framework, we understand it is only applicable to a discrete (but realistic) set of scenarios, primarily in the business-to-business context. Many weighty questions remain undecided, such as whether:
training LLMs on copyrighted material is fair use;
AI-generated drugs can be patented; and
A chatbot can dilute a trademark.
We are steering clear of these for now, with the goal of providing clear and actionable guidance to help organizations protect their sensitive data from unwitting disclosure and ensure their AI products have clearly delineated ownership.
If this all seems complex and you are worried about the cybersecurity of your information when using AI, StackAware can help.
And if you need expert legal counsel, Goodwin is ready to assist.
Related LinkedIn post
Thanks Joel. GenAI is having a massive impact on Risk Management and Legal as far as I can tell. This is definitely an area to watch out for in 2024 and beyond
Thanks Walter and Joel. I'm on weekend, so will read tomorrow, but this conversation definitely needs to happen.