OpenAI's Zero Data Retention

The most secure way to use SaaS generative AI tools.

Apr 30, 2024

[Update (12 June 2025)] The court order requiring OpenAI to indefinitely retain data from certain products changes components of my analysis here, but not the conclusion. The risk of an indefinite legal hold on everything processed by a company makes the option of ZDR even more appealing, because a court order can’t force preservation of something already deleted. Additionally, it looks like OpenAI’s ZDR-enabled API users are specifically exempted from the order going forward because there was never any data retained to begin with.

It is, however, conceivable that a future order requires OpenAI (or other Software-as-a-Service (SaaS) provider) to stop offering ZDR entirely during a legal hold. This risk is unavoidable if you use software hosted by someone else, and could of course materialize for your own Infrastructure-as-a-Service (IaaS) environment if you host AI models there. I suppose you should add a “likelihood of getting hit with high-profile litigation hold” risk to your evaluation of 3rd parties, but what happened to OpenAI is a rare enough event to not change my risk calculus substantially.

I was speaking with a healthcare company about processing protected health information (PHI) with OpenAI’s application program interface (API). I recommended that, if they do this, they should ask for Zero Data Retention (ZDR) and make getting it a condition of using the tool.

ZDR means just that: OpenAI doesn’t retain any prompt or response on their end (as opposed to the default of keeping them for 30 days, or longer with a legal hold). They just let the model respond via API call back to you, and that’s it. They don’t, however, grant ZDR lightly and I have heard from clients that getting it is an involved process.

With that said, it’s probably worthwhile. I think there are only a few circumstances where there would be some risk of retention, in descending order of likelihood:

1. An attacker intercepts prompts and completions as they happen

This would be pretty tricky and requires persistent access, but it’s feasible.

While OpenAI itself still wouldn’t be retaining anything, the attacker would be siphoning off the data to its own storage.

The attacker would need to avoid detection for a long time and avoid getting kicked out accidentally if a vulnerability were patched, connection interrupted, etc.

Such an attacker would also likely not be motivated by money, because this level and persistence of access would likely mean that it could do other, more profitable things like deploy ransomware, etc. But didn’t because siphoning the data is more important.

So this would likely be a governmental actor (or one backed by a nation-state).

2. There is an architectural flaw that makes it not really ZDR

A bug in the architecture that means data is in fact retained by accident is also an option. This would make it possible for an attacker (or litigant through discovery) to potentially retrieve it.

People - even smart ones - make mistakes. And the leakage of chat summaries and payment information between ChatGPT users last year shows this isn’t impossible.

It seems like the storage volumes required to keep this data, however, would quickly gain someone’s attention. If a ton of data for one or more customers started incurring cloud storage costs, I feel like this would get detected pretty quickly.

3. OpenAI is lying about ZDR

If OpenAI were lying about ZDR, on the other hand, I think this would be a reputation death sentence (not to mention a massive litigation risk).

So this is extremely unlikely.

The most conceivable situation in which this occur would be if OpenAI were mandated to lie through a gag order because:

4. The U.S. government requires OpenAI to build in a backdoor

Under the Communications Assistance for Law Enforcement Act (CALEA), service providers do need to ensure law enforcement officers have a way to conduct electronic surveillance.

So, if the government:

received a warrant from a judge in a criminal case
got a Foreign Intelligence Surveillance Act (FISA) order
targeted a non-US person abroad using Section 702 of FISA

Refer a friend

it could conceivably require OpenAI to create a bespoke system for retaining customer data.

Again, while possible, this seems highly unlikely.

It is far-fetched to think that the key information the government might need would be contained in prompts and responses from an enterprise client of OpenAI.

And it’s also optimistic (or pessimistic?) to think the government has the technical ability to sift through what would be insane volumes of data.

Assuming such a backdoor exists, there have been abuses of surveillance authorities and the risk of the government’s repository of surveillance data getting hacked is real. But I consider the likelihood of either of these occurring to be negligible for an individual company.

ZDR in other services

If for some reason you still have concerns about OpenAI’s ZDR, you can get similar functionality from different companies.

With that said, the same threat vectors apply to all of these offerings as do for OpenAI.1

Microsoft’s Azure OpenAI Service

By successfully getting an exemption from abuse monitoring, neither your prompts or completions will be stored.

There was a bit of a fracas earlier this year when some law firms apparently didn’t realize their data was being retained and potentially monitored by Microsoft for abuse monitoring. So clearly they hadn’t gotten an exemption. But I feel like Microsoft was quite explicit about how this worked and was confused as to how this became a news story.

Amazon Web Services (AWS) Bedrock

By default, Amazon’s Bedrock generative AI service does not retain prompts or responses. It applies automated abuse monitoring in-stride, but does not retain the underlying data. It does retain the results (pass/fail) of the abuse classification, though, likely to identify higher risk accounts.

Hosting a model using Infrastructure-as-a-Service (IaaS)

Many organizations view operating open source or home-grown models hosted using IaaS as the most secure option. From my understanding, this perception is due to concerns about data retention and handling by Software-as-a-Service (SaaS) providers.

At least in terms of the risks describe above, though, this approach is no different.

Separately, your risk of unintended training is lower, but that of security misconfiguration is higher.

Bottom line

OpenAI’s ZDR seems like a secure option for companies who want to use a generative AI in a SaaS deployment method. While nothing is risk-free, this seems like a pretty good approach.

Are you considering how to build your AI infrastructure? Need help making decision related to:

Cybersecurity?
Compliance?
Privacy?

StackAware can help.

Book a call

The only exception would be if you hosted a model truly on-premises. In that case you would have to know if the government wanted to build a backdoor - because you would need to build it for them. But you wouldn’t be able to do or say anything about it anyway.

Deploy Securely

Discussion about this post