I was speaking with a healthcare company about processing protected health information (PHI) with OpenAI’s application program interface (API). I recommended that, if they do this, they should ask for Zero Data Retention (ZDR) and make getting it a condition of using the tool.
ZDR means just that: OpenAI doesn’t retain any prompt or response on their end (as opposed to the default of keeping them for 30 days, or longer with a legal hold). They just let the model respond via API call back to you, and that’s it. They don’t, however, grant ZDR lightly and I have heard from clients that getting it is an involved process.
With that said, it’s probably worthwhile. I think there are only a few circumstances where there would be some risk of retention, in descending order of likelihood:
1. An attacker intercepts prompts and completions as they happen
This would be pretty tricky and requires persistent access, but it’s feasible.
While OpenAI itself still wouldn’t be retaining anything, the attacker would be siphoning off the data to its own storage.
The attacker would need to avoid detection for a long time and avoid getting kicked out accidentally if a vulnerability were patched, connection interrupted, etc.
Such an attacker would also likely not be motivated by money, because this level and persistence of access would likely mean that it could do other, more profitable things like deploy ransomware, etc. But didn’t because siphoning the data is more important.
So this would likely be a governmental actor (or one backed by a nation-state).
2. There is an architectural flaw that makes it not really ZDR
A bug in the architecture that means data is in fact retained by accident is also an option. This would make it possible for an attacker (or litigant through discovery) to potentially retrieve it.
People - even smart ones - make mistakes. And the leakage of chat summaries and payment information between ChatGPT users last year shows this isn’t impossible.
It seems like the storage volumes required to keep this data, however, would quickly gain someone’s attention. If a ton of data for one or more customers started incurring cloud storage costs, I feel like this would get detected pretty quickly.
3. OpenAI is lying about ZDR
If OpenAI were lying about ZDR, on the other hand, I think this would be a reputation death sentence (not to mention a massive litigation risk).
So this is extremely unlikely.
The most conceivable situation in which this occur would be if OpenAI were mandated to lie through a gag order because:
4. The U.S. government requires OpenAI to build in a backdoor
Under the Communications Assistance for Law Enforcement Act (CALEA), service providers do need to ensure law enforcement officers have a way to conduct electronic surveillance.
So, if the government:
received a warrant from a judge in a criminal case
got a Foreign Intelligence Surveillance Act (FISA) order
targeted a non-US person abroad using Section 702 of FISA
it could conceivably require OpenAI to create a bespoke system for retaining customer data.
Again, while possible, this seems highly unlikely.
It is far-fetched to think that the key information the government might need would be contained in prompts and responses from an enterprise client of OpenAI.
And it’s also optimistic (or pessimistic?) to think the government has the technical ability to sift through what would be insane volumes of data.
Assuming such a backdoor exists, there have been abuses of surveillance authorities and the risk of the government’s repository of surveillance data getting hacked is real. But I consider the likelihood of either of these occurring to be negligible for an individual company.
ZDR in other services
If for some reason you still have concerns about OpenAI’s ZDR, you can get similar functionality from different companies.
With that said, the same threat vectors apply to all of these offerings as do for OpenAI.1
Microsoft’s Azure OpenAI Service
By successfully getting an exemption from abuse monitoring, neither your prompts or completions will be stored.
There was a bit of a fracas earlier this year when some law firms apparently didn’t realize their data was being retained and potentially monitored by Microsoft for abuse monitoring. So clearly they hadn’t gotten an exemption. But I feel like Microsoft was quite explicit about how this worked and was confused as to how this became a news story.
Amazon Web Services (AWS) Bedrock
By default, Amazon’s Bedrock generative AI service does not retain prompts or responses. It applies automated abuse monitoring in-stride, but does not retain the underlying data. It does retain the results (pass/fail) of the abuse classification, though, likely to identify higher risk accounts.
Hosting a model using Infrastructure-as-a-Service (IaaS)
Many organizations view operating open source or home-grown models hosted using IaaS as the most secure option. From my understanding, this perception is due to concerns about data retention and handling by Software-as-a-Service (SaaS) providers.
At least in terms of the risks describe above, though, this approach is no different.
Separately, your risk of unintended training is lower, but that of security misconfiguration is higher.
Bottom line
OpenAI’s ZDR seems like a secure option for companies who want to use a generative AI in a SaaS deployment method. While nothing is risk-free, this seems like a pretty good approach.
Are you considering how to build your AI infrastructure? Need help making decision related to:
Cybersecurity?
Compliance?
Privacy?
StackAware can help.
The only exception would be if you hosted a model truly on-premises. In that case you would have to know if the government wanted to build a backdoor - because you would need to build it for them. But you wouldn’t be able to do or say anything about it anyway.