Unintended Training

Clarifying data security concerns with 3rd party-hosted AI models.

Aug 11, 2023

Check out the YouTube, Spotify, and Apple Podcasts versions.

A consensus appears to have developed that running artificial intelligence (AI) models yourself is necessarily more secure than having a third party do it for you.

I’m not so sure.

All other things being equal, Software-as-a-Service (SaaS) is the most secure deployment model. Check out this article for detailed reasons why, but to summarize, they include the:

Better security operations team familiarity with the software
Faster relative speed of vulnerability remediation
Lower chance of misconfiguration

when using SaaS.

Additionally, many proponents of self-hosting AI models as a security measure seem to suggest that entrusting another organization with your sensitive data at all is always the wrong call. Since very few companies (with some notable exceptions) these days aren’t doing this - at the very least most organizations leverage Infrastructure-as-a-Service (IaaS) offerings from hyperscale cloud providers - this strikes me as an odd position.

Defining unintended training

With that said, AI adds a wrinkle into the equation due to the phenomenon of what I call “unintended training.”

I define unintended training as whenever an AI model trains on information that the provider of such information - in retrospect - would not want the model to be trained on. This is different than sensitive data aggregation because the data trained on itself is sensitive, whereas with aggregation the individual pieces are not sensitive but the end result is. Examples include systems trained on:

Material non-public financial information
Personally identifiable information
Trade secrets

where those not authorized to see this underlying information nonetheless can interact with the model itself and have it reproduce the data in question.

Should a model be unintentionally trained on this data, it could conceivably regurgitate it - or something similar - to a party who shouldn’t have access.

Level up your security

Reported victims of this phenomenon include:

Advantages of self-hosting

It’s not entirely clear how big of risk this vector represents, but since we are in the early days of generative AI, it is worth taking seriously. And when analyzing unintended training through a security lens, hosting a model yourself has the primary advantages of:

Controlling who else can use the model, be it trained intentionally or otherwise.
Being able to “roll back” the model to a previous version if any unintended training that does occur is serious enough to warrant doing so.

With most commercially-available SaaS options (e.g. GPT-4), neither of these are possible. Although #1 is an option for some Microsoft offerings.

Additionally, unintended risk is not zero even if you are hosting a model yourself. For example, a healthcare company might unintentionally train a self-hosted LLM on protected health information (PHI). This could pose a major problem - even if they don’t make the model publicly accessible - if an otherwise authorized contractor who has not signed a business associate agreement (BAA) gets access to the model.

Conclusion

Aside from the two advantages of using non-SaaS AI models I mentioned, however, the SaaS vs. IaaS analysis is identical to any other situation. From a purely cybersecurity perspective1, SaaS is the better way to go (again, ceteris paribus). And as I have noted before, there are ways to mitigate the risk of unintended training when using SaaS. In any case, you’ll need to conduct a systematic risk analysis to make the right call on which approach is best for you.

Since precision of language is important, though, I thought it made sense to clearly label the specific issue to which people are likely referring when they vaguely describe “data security” concerns related to SaaS AI models.

Unintended training is a risk that appears to be more serious when using an AI model operated by a 3rd party, but is simply one of many information security (and business) challenges organizations need to take into account when making build vs. buy decisions.

Concerned about unintended training risk?

Don’t know what to do?

Get your Data Defense Blueprint

Related LinkedIn posts

For those getting ready to say “but a contractual/regulatory/other legal requirement says I can’t use a SaaS-based model!” I’m not a lawyer, but think it’s sound to advise you to follow those requirements (assuming they actually exist and aren’t just made up, which I believe most such alleged requirements to be, outside some narrow exceptions). By “purely cybersecurity perspective,” I am referring to only that pertaining to defending the confidentiality, integrity, or availability of data from malicious impact. As I mentioned, there are always multiple factors at play when making these decisions.

Deploy Securely

Discussion about this post