OpenAI's solid preparedness framework

AI governance when the stakes are highest.

Dec 21, 2023

“Existential risk” is a pretty scary phrase, both when it comes to AI and otherwise.

Deploy Securely doesn’t focus too much on the more theoretical risks of artificial intelligence, preferring a more focused and practical approach on day-to-day issues that information security practitioners face.

With that said, there can can be some overlap: OpenAI’s “Preparedness Framework” is one of those cases.

At the end of 2023, OpenAI released a beta of the document, which explains the company’s approach to AI safety and security. In general, I was pretty impressed. In this post, I’ll break it down.

OpenAI punts on risk quantification

But I’ll have to start with my biggest criticism of the framework: its reliance on the fundamentally flawed method of describing risks qualitatively. Here’s the rubric OpenAI uses:

Later on, the paper itself glaringly points out the imprecision of the system:

“low” on this gradation scale is meant to indicate that the corresponding category of risks is not yet a significant problem, while “critical” represents the maximal level of concern.

Thanks - super helpful.

It also goes on to say:

Our current estimates of levels and thresholds for ‘medium’ through ‘critical’ risk are therefore speculative.

The company does make some efforts at quantification, describing a “catastrophic risk” as one that could result in:

hundreds of billions of dollars in economic damage; or
severe harm or death of “many” individuals.

Unfortunately, as I have pointed out before, one can quite reasonably disagree on what “many” deaths are. And it would be helpful to have clearer numbers for the economic damage described.

The scorecard examples provided later in the document have some placeholders for quantitative metrics, such as for model performance on cybersecurity-related tests or capture-the-flag events, but doesn’t give any actual figures.

This was a missed opportunity. The tools exist to communicate about risk in quantitative terms, such as using the Artificial Intelligence Risk Scoring System (AIRSS).

But nails the tracked risk categories

The core of the Preparedness Framework is an analysis system for four “tracked risk” categories. OpenAI did a good job here with their definitions and I think they covered the top issues. They are:

Cybersecurity
Chemical, Biological, Nuclear, and Radiological (CBRN)
Persuasion
Model autonomy

For each type of risk, they also provide a series of milestones which would characterize each level (“High,” “Low,” etc.) which are interesting and somewhat useful benchmarks.

It’s also worth noting the first three groupings are fundamentally different than the fourth. That’s because the initial three are really just a question of how “democratized” a given potential harm becomes. None of the threats described in the cybersecurity, CBRN, or persuasion categories of the Preparedness Framework (even at the “Critical” level) are science-fiction in nature and have already materialized:

On the cybersecurity side, the SolarWinds and Office of Personnel Management (OPM) hacks demonstrated that well-resourced but mostly manual attacks could penetrate sensitive American government networks.
As far as CBRN risks go, the former Soviet Union developed a range of “novel [biological] threat vectors” that were highly transmissible and lethal without AI.
On the persuasion side, some have argued the past two U.S. elections were both materially impacted (in different directions) by concerted messaging campaigns based on false premises. Neither effort relied heavily on AI-generated content.

And it seems that across all three of these categories, the key factors for the “High” and “Critical” levels are the ability of an AI model to produce previously-undiscovered methods of inflicting harm.

Model autonomy, though, is a different problem.

Refer a friend

“High” and below risk still suggests the model is controllable, but “Critical” basically means the end of humanity as we know it.

Assuming the model is perfectly aligned with humanity’s goals (unlikely), then it will have become something like an economic perpetual motion machine where it just improves and replicates continually.
If it’s not perfectly aligned (more likely), then it can self-exfiltrate and then do…whatever it wants.

If and when a model jumps from controllable to uncontrollable is likely to be a very gray line. One worth watching very closely.

Finally, OpenAI also acknowledges a set of “unknown unknowns” for which it develops a process to regularly re-evaluate them. If you’ve read Deploy Securely for a while, you know I like this framing.

Setting the “safety baseline”

An important question after doing a risk analysis is “so what do we do about it?” OpenAI answers this question fairly clearly:

For model deployment, the maximum threshold is “Medium” (with mitigations in place).
For continued model development, the maximum threshold is “High.”

The problem, though is that these qualitative values make tradeoff decisions very difficult.

What if the United States is about to lose a war with China and the only way to turn things around without launching nuclear weapons is deploying an AI-powered cyberweapon with “Critical” capabilities?
What if a “High” risk model can develop a cure for cancer but, even with mitigations, it can also create a highly-transmissible form of Ebola?
What if a AI-powered personal trainer with “High” persuasion capabilities can meaningfully reduce obesity-related deaths in the United States and relieve a huge burden on the healthcare system and economy?

I’m sure most people would say “we should address that on a case-by-case basis.”

Unfortunately, these decisions are going to happen under intense time pressure and public scrutiny. That makes it very difficult to conduct them rationally without clear agreement on acceptable tradeoffs established ahead of time.

This is one of the few areas where I’m comfortable getting the government involved ahead of time.

OpenAI’s governance program looks good on paper

Developing a thorough risk management program means nothing if you haven’t implemented a system for taking action. And OpenAI does a good job here of laying out how they will make certain decisions and what steps they’ll take. Obviously the proof is in the pudding, but this framework is an important step in building (and rebuilding after the chaos accompanying Sam Altman’s firing and re-hiring as CEO) confidence in the company

Mitigations

The paper really only discusses three of the four (accept, avoid, and mitigate) risk management tools because I don’t think it’s really possible to transfer existential risk to humanity. In terms of mitigation, they propose some creative solutions such as:

Compartmentalizing access to algorithmic secrets or model weights.
Restricting the environments in which the model operates.
Limiting deployment to certain trusted users.
Enhancing information security controls.
Alerting distribution partners.
Redacting training data.
Implementing refusals.

A clear structure for decision-making

Here, the framework does something simple but vital: it names the CEO or his designee the default decision maker! When it comes to information risk, this is the right call. But I have often seen it done incorrectly. There could have been some strange set-up where the Safety Advisory Group (SAG) chair described in the document had veto authority on certain issues, but OpenAI resisted this temptation.

The SAG is just that - an advisor.

And although I hate the phrase “for the avoidance of doubt,” I do like the fact that the paper makes clear “SAG does not have the ability to ‘filibuster’.” One of the worst mistakes an organization can make when it comes to information security governance is creating a complex bureaucratic process for approvals without putting a time- or conditions-based limit on the decision.

If there is no incentive for a committee or panel to make a decision, why would they?

The process also hints at the previous fracas caused by Sam Altman’s firing by saying the board of directors (BOD) “will receive appropriate documentation (i.e. without needing to proactively ask) to ensure the BOD is fully informed and able to fulfill its oversight role.” Obviously this requires the CEO’s active participation, so it’s not exactly self-executing.

Finally, the Framework requires yearly safety drills. For a fast-moving situation, trying to “wing it” is almost certain to fail, so I’m happy to see this provision here. I’d be even more interested in OpenAI’s incident response procedures, but completely understand they’ll want to keep those confidential.

Conclusion

If you have the time and interest, I recommend reading the full document. It is a pretty comprehensive approach to the AI safety problem. While it would have benefitted from risk quantification, it still provides some excellent practices and procedures for organizations building AI governance and cybersecurity programs.

Need help with yours?

Get your Data Defense Blueprint

Deploy Securely