Transparency toolkit: 3 principles to secure and fair AI training with customer data

Protecting privacy while fostering innovation.

Mar 09, 2024

AI training on customer data is a sensitive subject.

Zoom’s turbulent experience last summer and the public response to questions I asked about DocuSign’s AI training processes make clear people are paying this topic a lot of attention.

While most would probably prefer that AI is not trained on their data - all other things being equal - there is no AI without training data.

This creates a paradox which I will attempt to solve.

While I have made some proposals about technical architectures for secure AI training and processing, they don’t address what a business model should look like.

Deploy Securely is focused on actionable, real-world advice, so that’s what I provide here. Below I’ll lay out some principles for how to train AI on customer data in a win/win manner. Note: this will be specific to non-public information because AI training on public data opens up a whole different can of worms that I don’t want to tackle here.

AI training on customer data must be transparent

Clearly stating how you are training your AI models on information provided in confidence is not negotiable in 2024.

Stealthily tucking changes into your terms and conditions or just leveraging existing language like “company may create derivative works using customer data” to justify AI training just won’t cut it.1

Although I don’t give legal advice, an attorney and I did write a post discussing how to structure these agreements. I also went in-depth on the best ways to communicate AI training and processing here.

And if you are relying on non-contractual legal authorities for your training, like the research provision of the Health Insurance Portability and Accountability Act (HIPAA) security rule, clearly say so.2

Make AI training opt-in where feasible

Practices vary widely when it comes to customer consent:

AI training on your Gmail messages for spam filtering is mandatory
Training when using ChatGPT’s free version is opt-out
For Claude, it’s (mostly) opt-in

There is no “right” way to structure this, and I don’t agree with some privacy absolutists who suggest you have both the right to a) use a company’s product and b) dictate exactly how they build it and leverage (or don’t) your data.

With that said, there is almost always going to be reputation risk associated with forcing - or nudging - people into AI training.

The “cleanest” approach is thus to incentivize opt-ins using financial incentives. These can take the form of:

Discounts on products you sell
Gift cards or actual gifts
Cold, hard cash

Again, there are folks who oppose such a system and want to restrict this type of commerce between consenting parties. But this is just them forcing their concept of “privacy” onto other people who don’t necessarily view it the same way.

Refer a friend

For high-risk, high-reward use cases, customers should approve the results of de-identification processes before AI training on their data

In some situations it is unnecessary to de-identify data prior to AI training, like with temperature readings.

In other situations, de-identification will be relatively easy. For example, you could simply remove all columns linked to an individual person in a SQL database containing supermarket sales records.3

But in a third type of situation, de-identification will be both necessary and difficult.4 Generative AI training on contracts is one such example. After I asked publicly about DocuSign’s practices in this respect, they responded:

The DocuSign de-identification process removes identifiers such as names, addresses, bank account information, and social security numbers (SSNs) from data used for model training.
The DocuSign anonymization process transforms and/or modifies a customer’s data in such a way that it can no longer be used to identify or trace back to an individual organization.

I still have some open questions (which I re-iterated to one of their engineers), however, specifically asking whether this is done:

Manually?
With a rules-based approach?
Using another AI model or process?

I suspect #2 is the answer, but all of these methods are likely going to miss unstructured sensitive data like:

Financial projections for a company undergoing a merger or acquisition
Descriptions of trade secrets sold from one party to another
Personal information embedded into images rather text

With that said, automating the drafting of legal documents can definitely create a lot of value by saving money in lawyers’ fees.

So this is a high-risk, high-reward situation.

I think the best way forward here is to require customers to confirm the results of such de-identification before training AI on the redacted material.5 Companies can incentivize this by compensating customers for each page (or other unit of data) approved or corrected (if not properly de-identified).

And if the company is using AI for de-identification, that algorithm gets better every time.

This will form a virtuous cycle over time, where

the de-identification process will continuously improve
customers will get rewarded for their data and efforts
the company can sell its AI product with confidence

AI isn’t going away, so companies need to figure out how to leverage it securely and responsibly

The powerful economic forces driving aggressive roll-outs of AI-powered products and features are not going to abate. Getting a handle on the relevant issues related to:

Cybersecurity
Compliance
Privacy

is more important than ever.

Need help navigating them?

Book a call

Related LinkedIn posts:

OpenAI offer of free tokens in exchange for allowing training.

The Federal Trade Commission (FTC) recently warned such stealthy changes might constitute unlawfully unfair or deceptive practices.

Thanks to Casey Douglas for her feedback on a previous article regarding HIPAA and anonymization, which I have incorporated here.

The FTC also recently initiated action against a company based on its allegedly misleading claims about data anonymization, so clear and accurate statements are key here.

If you are using data for AI training based on a statutory rather than contractual authority, obviously follow the applicable de-identification requirements, like for HIPAA.

I took inspiration for this specific approach from Jonathan Todd’s comment on my post about DocuSign’s AI training.

Deploy Securely