

Discover more from Deploy Securely
4 things to do about sensitive data generation by AI systems
The internet definitely knows you are a dog.
Check out the YouTube, Spotify, and Apple Podcast versions.
The most (potentially over-)hyped security risk with LLMs is unintended training. While you shouldn't dismiss it, I think something else is going to be a much bigger problem: sensitive data generation (SDG).
This happens when a person or organization makes data available (intentionally or unwittingly) to an AI model that - by itself - the provider of information would not consider sensitive (as opposed to unintended training, where the data provided is itself sensitive). But when aggregated with other pieces of information provided in different contexts, the AI model can produce outputs the original provider would consider sensitive.
For example, according to the paper “Beyond Memorization: Violating Privacy via Inference with Large Language Models,” the authors found:
current LLMs can infer a wide range of personal attributes (e.g., location, income, sex), achieving up to 85% top-1 and 95.8% top-3 accuracy at a fraction of the cost (100×) and time (240×) required by humans.
This infographic demonstrates exactly such aggregation can happen, using an example from a Reddit post:

So I think we can declare the below meme effectively dead (if it weren’t already), because the internet definitely knows a lot more about you than your species:

On top of the obvious implications here for GDPR compliance (which will drive demand for solutions like machine unlearning) and privacy more broadly, there are other risks.
For example, given enough context and tidbits of information made publicly available, LLMs can likely reproduce through inference:
Unannounced but planned key executive changes
Material non-public information (MNPI)
Trade secrets
What to do about sensitive data generation
In the near term, make sure you are thinking about these challenges now as part of your AI governance program, incorporating SDG considerations into your policy (grab your free template here). Just alerting employees to the phenomenon may help to mitigate some of the immediate risk.
Focus most of your effort mitigating SDG risk as it relates to heavily-regulated data like personal information and potentially MNPI. The latter will be much harder to control due to its lack of formal definition and essentially infinite forms it can take. Eventually, companies and regulators will need to rethink this concept entirely.
Be prepared for hugely impactful regulatory decisions, most likely coming from the European Union (EU), about privacy as it relates to AI model development. For example, EU court decisions are already forcing Meta to completely redesign its business model and products due to the latter’s reliance on data collection for advertising. To deal with this, consider:
An emergency decommissioning plan for any potentially impacted systems.
Your willingness to block EU data subjects from using your products.
Longer-term, simply accept that trade secrets and other proprietary information will have a shorter “half-life” of competitive advantage due to the combination of AI and ubiquitous posting on social media.
Yes. All of this is going to be disruptive.
But you need to deal with it.
Want to learn more about how to prepare your company for AI-powered transformation?
Need more details? Check out this page.