3 things to do about AI-powered external sensitive data generation

The internet definitely knows you are a dog.

Oct 27, 2023

Check out the YouTube, Spotify, and Apple Podcast versions.

The most (potentially over-)hyped security risk with LLMs is unintended training. While you shouldn't dismiss it, I think something else is going to be a much bigger problem: sensitive data generation (SDG).

This happens when a person or organization:

makes data available (intentionally or unwittingly) to an AI model that - by itself - the provider of information would not consider sensitive (as opposed to unintended training, where the data provided is itself sensitive); and
when aggregated with other pieces of information provided in different contexts, the AI model can produce outputs the original provider would consider sensitive.

For example, according to the paper “Beyond Memorization: Violating Privacy via Inference with Large Language Models,” the authors found:

current LLMs can infer a wide range of personal attributes (e.g., location, income, sex), achieving up to 85% top-1 and 95.8% top-3 accuracy at a fraction of the cost (100×) and time (240×) required by humans.

This infographic demonstrates exactly such aggregation can happen, using an example from a Reddit post:

Image credit: Robin Staab, Mark Vero, Mislav Balunovic, and Martin Vechev

So I think we can declare the below meme effectively dead (if it weren’t already), because the internet definitely knows a lot more about you than your species:

Given enough context and tidbits of information made publicly available, LLMs can likely reproduce through inference:

Unannounced but planned key executive changes
Material non-public information (MNPI)
Trade secrets

I describe this specific risk as external sensitive data generation, which occurs when someone else infers information (which you would prefer they didn’t) about you or your organization using AI.

This is separate from internal sensitive data generation, where your organization is doing the inference but don’t want to because of privacy or regulatory reasons. I cover this aspect of the phenomenon in a separate article.

Refer a friend

What to do about external sensitive data generation

In the near term, make sure you are thinking about these challenges now as part of your AI governance program, incorporating SDG considerations into your policy (grab your free template here). Just alerting employees to the phenomenon may help to mitigate some of the immediate risk.
Focus most of your effort mitigating SDG risk as it relates to heavily-regulated data like personal information and potentially MNPI. The latter will be much harder to control due to its lack of formal definition and essentially infinite forms it can take. Eventually, companies and regulators will need to rethink this concept entirely.
Longer-term, simply accept that trade secrets and similar information will have a shorter “half-life” of confidentiality. This will happen due to the growing power and availability of AI models combined with the huge amounts of data available across the internet in general (and social media in particular).
- For example, I doubt the famously secret formula for Coke will remain a secret indefinitely.
- Building in public is also going to become the norm because speed and generating buzz will become relatively more important than developing proprietary solutions in stealth.

Need help managing the security, privacy, and compliance implications of AI?

Yes. All of this is going to be disruptive.

But you need to deal with it.

StackAware helps data and cybersecurity leaders at AI-powered companies manage risk while still delivering business value. Our AI security sprint will get you on track in 90 days.

Book a call

Related LinkedIn posts

Deploy Securely

Discussion about this post