Managing intellectual property risk with AI

A realistic set of principles.

Jan 07, 2025

Note: As always, I am not an attorney and this is not legal advice.

“Do not infringe on third-party intellectual property rights.”

Quite a few artificial intelligence (AI) policies I have read say something like this. While admirable, these statements provide no actionable guidance to employees:

For patents and trademarks, expecting an employee to conduct a global search for similar works (and knowing what might be infringing) is unrealistic.
For copyright, there is no database of what is protected to begin with.
Acceptable practices for training AI models on the above are also unclear.

It’s understandable companies want to be conservative in the face of legal ambiguity. And there is a lot of it when it comes to AI, especially the generative variety. But broad statements impossible for the average employee to understand lead to either:

Apathy and slowed innovation; or
Freelancing, where people just make up (and break) rules as they go.

Do AI governance frameworks and standards help here?

Not really.

They all give high-level guidance. But here is what they say about intellectual property (IP) considerations:

National Institute of Standards and Technology Artificial Intelligence (AI) Risk Management Framework (RMF)

GOVERN 6.1: “Policies and procedures are in place that address AI risks associated with third-party entities, including risks of infringement of a third party’s intellectual property or other rights.”
MAP 4.1: “Approaches for mapping AI technology and legal risks of its components – including the use of third-party data or software – are in place, followed, and documented, as are risks of infringement of a third-party’s intellectual property or other rights.”

ISO/IEC 42001:2023

Clause 4.1: “External and internal issues to be addressed under this clause can vary according to the organization’s roles and jurisdiction and their impact on its ability to achieve the intended outcome(s) of its AI management system. These can include, but are not limited…applicable legal requirements.”
Annex B.7.3: “The organization can need different categories of data from different sources depending on the scope and use of their AI systems. Details for data acquisition can include…data rights (e.g. PII, copyright)”

HITRUST AI Security Certification

Baseline Unique ID 19.06cAISecOrganizational.1: “The organization performs an assessment to identify and evaluate its compliance with constraints on the data used for AI efforts (i.e., data used for training, validating, tuning, and augmenting the prompts of AI systems via RAG), including those related to applicable laws, applicable regulatory requirements, applicable contractual obligations, the organization’s self-imposed data governance requirements, and copyrights or commercial interests.”

So how should I think about intellectual property risk with AI…like really?

I came up with seven risk thresholds companies can use to set guidelines for AI-related IP-related risk. Leaders will need to understand and accept a certain degree of it until there is more legal clarity (probably years from now).

If you attempt to avoid it, you’ll likely take on even greater competitive risk. There is also evidence (discussed below) you’ll encourage shadow AI use by employees just trying to get their jobs done. And there is no way to apply any guardrails in this situation, creating potentially even greater IP infringement risk.

How the approach works:

I recommend using a high water mark approach to risk across the entire company. While business units should use the lowest risk approach possible that achieves their objectives, do not allow any to exceed the maximum risk level the company as a whole accepts.
On a related note, companies building AI governance programs often feel the urge to separate "internal" (e.g. productivity tools for coding, content generation, etc.) use cases from “external” (e.g. sold as a product) ones. But that is generally a mistake. As with managing other AI risks, I recommend a comprehensive, enterprise-wide approach. That means this framework applies to companies training, selling, buying, and using AI.
I suggest noting acceptable risk levels in an “intellectual property compliance standard” or similar document to which employees can refer.
When I describe “training” of AI models, this includes fine-tuning ones another party has already trained. While there are technical differences, from a security and compliance perspective they are similar.
I don’t focus on restrictions on using AI or content it generates except those from an intellectual property perspective. See this article for a description of separate issues related to contractual restrictions.
Each risk level excludes anything explicitly permitted in the levels above.

Importantly, I’m not recommending any company adopt a certain threshold. Nor am I saying about the ethics of any given approach. This is all about managing reputation and litigation risk.

Here are the seven thresholds, in descending level of risk:

Level 5 - Train your own models on pirated content, ignore robots.txt instructions

According to lawsuits filed against OpenAI, Meta, NVIDIA, and Databricks:

Much of the material in the training datasets used by the defendants comes from copyrighted works—including books written by Plaintiffs—that were copied and used for training without consent, without credit, and without compensation. Many of these books likely came from “shadow libraries”, websites that piratically distribute thousands of copyrighted books and publications.

While it’s unclear to what extent such training on pirated content occurs, any company doing it should understand it is a high-risk approach from both a legal and public relations perspective. This is especially true when the authors of the content in question didn’t make it public on the internet themselves.

A Meta employee allegedly thought the same way in an internal email unearthed during the aforementioned lawsuits:

If there is media coverage suggesting we have used a dataset we know to be pirated, such as LibGen, this may undermine our negotiating position with regulators on these issues.

Such risk could conceivably be balanced against the performance benefits to be had from training on high-quality material that would be difficult or impossible to get a license to, but tread carefully.

[Update 27 June 2025] Two different California judges ruled in almost opposite directions at almost the same time regarding the implications of using pirated content for AI training. On June 23, Judge Alsup of San Francisco ruled that Anthropic’s use of pirated copies of books to build a digital central library of works was not fair use. On June 25, however, Judge Chhabria of the Northern District of California ruled it was fair use for Meta to train AI with copyrighted books, including those downloaded from shadow libraries.

[Update 26 August 2025] Anthropic and the plaintiffs against it announced they reached a settlement in the first case described above. This likely delays the establishment of a precedent regarding the use of pirated material in AI training.

On the same level of risk would be ignoring instructions in robots.txt files which many websites use to forbid certain (or all) bots and scrapers from indexing and consuming content on them.

Level 4 - Train your own models on public content for which you don’t have an explicit license

This seems to be the bulk of what OpenAI, Microsoft, Google and others - whom I’ll call the AI hyperscalers - are doing. At this level, you respect robots.txt instructions and only use what’s on the open internet. OpenAI’s official position is that generative AI training on publicly-available copyrighted content is legally-permitted “fair use.”

A gray(er) area arises when it comes to open-source software (OSS) made publicly available under a variety of different license terms (e.g. MIT, Apache 2.0). A separate lawsuit against GitHub for training its Copilot product on this type of OSS describes the practice as “piracy,” but I consider it to be less risky than level 5 activities because the OSS creators intentionally made the content public.

As history shows, large-scale AI model training at this risk level creates considerable litigation and reputation risk. Realistically, the risk is lower if you aren’t selling or benefitting commercially from the resulting model because you won’t be as visible (or appealing) a target.

Level 3 - Use AI tools or models trained by others that don’t offer intellectual property indemnification restrictions

The three AI hyperscalers mentioned above - and a growing number of others - offer indemnification provisions for certain products that give customers added confidence. Essentially, these firms offer to defend you (under certain conditions) against suits from third parties claiming you infringed on their intellectual property by using the hyperscaler’s AI tool.

But there are many cases where such indemnifications are not available, like when using:

Proprietary commercial tools that don’t offer them.
Open source models trained on data which the trainer doesn’t have an explicit license to.
Vendors leveraging AI hyperscaler services who do not extend the indemnification to you.

I’m not aware of any companies ever being sued for merely using the outputs of generative AI tools. Thus, companies may decide the benefits of using the tool outweigh the risks, even without indemnification.

Level 2 - Only use AI tools that give indemnification, follow restrictions closely

It’s unlikely the average firm operating on this level will see litigation on intellectual property infringement grounds. Especially with the explicit threat of bringing expensive legal teams to bear and the indirect nature of any alleged damages, it’s probably not worth it for any company to sue an AI hyperscaler's customer.

To take advantage of these indemnification provisions users often need to implement controls such as GitHub’s Copilot’s “block suggestions matching public code” feature. This prevents the tool from returning code suggestions over 150 characters that match public repositories. The 150 character limit is arbitrary but lets GitHub prevent its tool from returning code “too close” to the training data.

All things being equal, these types of features reduce the risk of a third party noticing - and taking action on - any allegedly infringing material. Be careful, though, because some indemnification provisions - such as for one major AI video tool - only protect against patent related-claims. Third-party claims on patent grounds seem much less likely than on copyright grounds (especially for AI-generated video), so read carefully.

Level 1 - Use only AI trained on data to which you (or the trainer) has an explicit license

This is a relatively conservative approach. Adobe, for example, claims that “[w]e only train Adobe Firefly on content where we have permission or rights to do so” and “[w]e do not mine content from the web to train Adobe Firefly.”

If every license agreement were ironclad, then conceivably there wouldn’t be any litigation risk at all of this approach. With that said, it wouldn’t address vendors who generate content using AI trained at risk level 2 or higher. They could still conceivably introduce “tainted” content into your supply chain.

But you are unlikely to be first in line to get hit with a suit.

With that said, a minority of generative AI vendors offer these types of guarantees, creating supplier concentration if you insist on them. Add to that the innovation roadblock of not being able to leverage “riskier” tools.

Level 0: “Don’t use AI”

Basically impossible

That doesn’t stop security practitioners (incorrectly) claiming their companies don’t use AI.

Because the biggest risk for companies from an IP standpoint would be generative (as opposed to predictive) AI use, it might be more reasonable to try to forbid that specific type.

With that said, according to one survey 8% percent of employees at companies banning ChatGPT admitted to using it anyway! And as I draft this article, Google Docs is suggesting auto-completions powered by generative AI.

Level -1: “Don’t use anything created by AI”

Even below level 0 (and even more unrealistic) is this position.

If you assume every piece of content ever created by generative AI is tainted because:

No one can own the copyright to it (or most of it); and
It might have been generated by a model infringing on someone else’s intellectual property

you might find yourself here.

Considering:

Even OpenAI cannot detect content created by its own tools;
There is no global database of AI generated-content; and
Watermarks are easily removed

good luck enforcing this policy.

If this is your risk appetite, please disconnect your computer from the internet, shut it down, and back away slowly.

Conclusion

Life - and especially business - is full of risks. While it’s tempting to simply say “don’t take any,” it’s not realistic. Doing so puts employees in a classic double bind where they need to both follow stifling and vague rules while also succeeding at their jobs.

What you can do, however, is document your risk appetite in a standard and:

Train employees on what to (not) do.
Remind them with in-stride warnings (e.g. browser notifications they are attempting to use unsanctioned tools).
Hold them accountable if they intentionally violate your policies.

Especially with the gray area associated with intellectual property and AI, I hope this guide will give you some ideas to manage the risk.

Need customized help tackling AI risk?

StackAware can help.

Book a call

Deploy Securely

Discussion about this post