GDPR compliance in the age of AI

Internal sensitive data generation and what to do about it.

Mar 29, 2024

OpenAI is under investigation or enforcement action from data privacy regulators in:

All of them are alleging some form of General Data Protection Regulation (GDPR) non-compliance. Because the GDPR is so expansive (and vague), these actions stem from a wide variety of underlying issues like:

Data collection and web scraping
Child protection
Hallucination

There are varying levels of validity to these accusations, but I think the toughest nut to crack is what I call internal sensitive data generation.

Previously I wrote about external sensitive data generation, which happens when parties external to you or your organization are able to intuit things about you that you wish they couldn’t. This is primarily a privacy (in the actual, not bureaucratic red-tape sense) and security problem.

But internal sensitive generation is mainly a privacy compliance problem. It occurs when you or your organization are able to intuit things about others that you wish you couldn’t.

Consider this situation: In 2023 OpenAI geo-blocked all of Italy from using ChatGPT after that country’s data protection regulator ordered the company to stop processing Italians’ data.

The thing is, the blocking didn’t stop the processing!

ChatGPT was still absolutely capable of answering questions about Italian “natural persons,” just to people outside of Italy. The European Union (EU) essentially claims that GDPR has global reach because it ostensibly applies to any organization processing the data of EU residents. So this blocking didn’t really solve the processing issue.

And for every EU resident that OpenAI cannot establish a lawful basis for processing his or her personal data, it’s going to have a big problem on its hands. Without AI, a company would just purge its databases of the offending information. But when that information has been embedded into model weights, stopping the processing becomes far more difficult.

Selective lobotomization of generative AI models, i.e. machine unlearning, appears to provide a possible solution. With that said, the effectiveness of the technique and impacts to performance at scale are unclear. And you can bet a sizable minority of EU data subjects are going to request it if it becomes an option.

Even if that challenge is technically surmountable, it will require continuous testing and validation of impacted models.

If you think this would be an enormous burden from an administrative and engineering perspective, you would be correct. But that doesn’t mean it won’t happen.

For example, EU court decisions are already forcing Meta to completely redesign its business model and products due to the latter’s reliance on data collection for advertising.

To deal with this, consider:

Using small language models (SLMs) wherever possible to reduce the risk of sensitive data generation.
An emergency decommissioning plan for potentially impacted systems.
How to process machine unlearning requests if GDPR requires you to take them.
Your willingness to block EU data subjects from using your products and from having their personal data processed (even as part of model weights).

Need help dealing with these types of changes while launching AI-powered products?

Book a call

Related LinkedIn posts

Deploy Securely

Discussion about this post