In June 2024, ChatGPT, Claude, Perplexity, and maybe Google Gemini all went down at roughly the same time. Here are 3 lessons:
1. Have failover plans
The more generative AI becomes integrated integrated into enterprise workflows, the greater the damage of an outage.
Make sure (through documentation and training) people know how to do their jobs temporarily without generative AI tools.
On a positive note, it doesn’t look like the OpenAI application program interface (API) went down. Thus enterprise use cases (especially automated ones) were less heavily impacted.
But this won’t always be the case.
So you’ll want to get back to normal quickly, which is you should:
2. Consider SLAs for generative AI
OpenAI, Anthropic, and Perplexity don’t offer publicly-available service level agreements (SLA). This means you would have to negotiate these individually for your company.
SLAs align incentives between customer and vendor by requiring payment to the former if the latter’s service degrades. As I’ve written before, you can consider SLAs for all 3 data attributes:
Availability
The most common SLA available.
Major cloud providers like Amazon Web Services (AWS), Microsoft Azure, and Google Cloud all offer credits to users if their services do not meet certain availability requirements. At least for now, this would represent a clear advantage of operating your own AI model on one of these Infrastructure-as-a-Service (IaaS) offerings.
The pure Software-as-a-Service (SaaS) AI tools mentioned above don’t offer such standardized guarantees.
Confidentiality
While not necessarily specific to AI, confidentiality SLAs are ultimately the most effective way to align incentives when it comes to data security. If a SaaS vendor needs to pay or credit you a certain amount if they get breached and your information is exposed, they will focus on managing that risk in the most cost-effective way.
Unfortunately, I have yet to see this type of arrangement in real life.
Integrity
The most interesting type of SLA when it comes to AI.
Although I haven’t yet seen this in practice either, it could be incredibly useful for customers of non-deterministic generative AI systems. Model drift, like that which has reportedly occurred with GPT-4, reduces customer confidence. It can also break workflows that rely on certain formats of outputs.
One solution to this problem?
Create a benchmark of prompts for which the vendor will guarantee certain responses. An easy example:
You are a simple calculator. You will respond to all arithmetic questions with only the numerical answer. My question: what is 2 + 2?
If the model responds with anything other that “4,” this would result in the customer receiving a credit from the vendor.
Obviously this type of system would require a lot of detailed planning and negotiation to implement, but it’s certainly feasible.
3. Analyze outage chain reactions as a risk
While not specific to generative AI, if there are a few relatively similar products serving the same market, one going down can impact others.
“Outage chain reactions” can happen as users rapidly shift from one product to another at the same time.
This may have been what happened as ChatGPT users shifted to Claude and then Perplexity as the tools all started having issues.
AI resilience isn’t optional
In a few years, having a generative AI tool suffer an outage will be a major issue for many enterprises. Slack going down has already become a meme-worthy event, and I expect the same will eventually be true for ChatGPT and its competitors.
Do you need help developing your own AI resilience plan?
StackAware can help: