A 3-level framework for security leaders to assess -aaS GenAI security

Training and data retention considerations when using SaaS and PaaS.

Sep 16, 2024

Check out this whiteboard session on YouTube.

Amazon lost $1,401,573 from ChatGPT data leakage.

Why?

Its employees didn’t realize an -as-a-Service (-aaS) generative AI provider was using their (confidential) data for model training.

So here’s a 3-level framework for understanding how this works addressing the security and compliance risks:

Level 1: Sending just system + user prompt

In the simplest case, companies like:

OpenAI
Microsoft
Anthropic

have an application programming interface (API) to which you can send system and user prompts.

The biggest issues?

Training: is the model improved based on what you send it? If so, this could cause data leakage. But none of the companies I mentioned train base models on your content by default.
Retention: how long does the provider keep your prompts? Even if it isn’t training on them, having your prompts and responses sit on someone else’s servers is a risk by itself. Explore zero data retention (ZDR) if your use case (healthcare, financial services) justifies it.

Level 2: Retrieval-augmented generation (RAG)

You can use RAG to “ground” generative AI model responses in proprietary data. This improves accuracy, but you should consider:

Are you providing context on-the-fly via LangChain?
Or is the provider storing it (OpenAI Assistants)?

In the first case, LangChain will search in whichever repository you are keeping the data, only sending needed context (and prompts) to the model provider.

In the second, however, you are storing ALL context data with a vendor, so vet it appropriately. And OpenAI doesn't offer ZDR for queries to Assistants.

In the case of M365 Copilot, your context data is already stored with the model provider (Microsoft in both cases), simplifying the analysis.

Level 3: Fine-tuning proprietary models

If you need to replicate a certain style or tone of responses, fine-tuning might be the way to go.

In this case you can tweak an existing proprietary model with your own data.

The model provider will store both the fine-tuning data and the resulting model. So training and retention policies are key.

AWS says Bedrock doesn’t train base models on fine-tuning data.
OpenAI retains fine-tuning data until manual deletion.
Inactive deployments in Azure auto-delete in 15 days.

You can combine fine-tuning with RAG. So the security considerations stack on top of each other.

Need help understanding AI-aaS?

Deciding how to host your AI deployments requires expertise and understanding of your business case. StackAware has the first already and will build the second as part of an AI risk assessment.

After understanding the key challenges from a:

Cybersecurity
Compliance
Privacy

perspective, we can design a customized AI governance program and even get you ready for ISO 42001 certification.

Ready to learn more?

Book a call

Deploy Securely

Discussion about this post

Ready for more?