Check out the YouTube, Spotify, and Apple Podcast versions.
Struggling to fix “all high and critical CVEs” is something security professionals find themselves doing regularly. But the value of this exercise is questionable.
What does matter is the underlying risk posed by a given piece of software, more so than just the number of vulnerabilities in it. In the context of AI, trying to identify individual vulnerabilities is like nailing jello to the wall. So instead of trying to do that, I decided to build a system to evaluate the overall risk of a model.
Thus was born the Artificial Intelligence Risk Scoring System (AIRSS, pronounced like the plural of “air”).
In scope
The AIRSS allows evaluating the confidentiality, integrity, and availability (CIA) data impacts to (and from) artificial intelligence systems. And it aligns well with existing quantitative approaches like. For example:
Known vulnerability risk in the model’s infrastructure can be evaluated with the Deploy Securely Risk Assessment Model (DSRAM).
Phishing and insider threat risk can be quantified with the Factor Analysis of Information Risk (FAIR) methodology.
The outputs of all of these approaches is a fungible unit: dollars / year. So, assuming you have developed a mutually-exclusive and completely exhaustive mapping of your cyber risks, you can plug in AIRSS wherever it make sense.
With that said, common data attributes and the relevant attack vectors covered are:
Prompt injection
Prompt injection is clearly a threat when it comes to data confidentiality, as it can potentially force the model to reveal information - trade secrets, personal data, etc. - in violation of its security policy.
Additionally, if the model is calling a function, a malicious prompt could impact the desired execution of that function by corrupting the inputs to it. Similarly, forcing a model to provide undesirable content (e.g. how to build a bomb) could also violate its security policy. This would be a data integrity (versus confidentiality) attack, so prompt injection can impact that attribute as well.
Sponge attacks
A potentially overlapping category of malicious acts are sponge attacks, which try to waste energy or spike latency in a model. Whether attempting to deceive computer vision systems and crash a car or simply forcing a company to waste money on OpenAI credits through specially-designed prompts, these can have serious real-world impacts by attacking the model’s availability directly.
Inference attacks
Similarly to prompt injection attacks, these try to extract information from AI systems in violation of their security policy. Separately, though, they can be targeted at the model itself (in addition to its underlying training data). Since proprietary AI systems themselves can be quite valuable, losing sole control over this intellectual property can cause major economic damage to an organization.
Sensitive data generation
Another confidentiality-related vulnerability occurs when the model aggregates disparate pieces of information that - by themselves - would not violate the model’s security policy, but then creates an output which does. This doesn’t necessarily require malicious intent on the part of anyone, but can certainly lead to a data breach if the model:
Aggregates biographical information about a person, his neighbors, and other data to return something that constitutes personal data. For example, if an LLM knows that my social graph intersects with several people living on a certain street, how many people are in my family, it could potentially intuit my street address if prompted.
Combines information from a company’s knowledge base - otherwise approved for public consumption - that reveals confidential or proprietary data about its inner workings based on inferences it makes.
Check out this post for a deep dive on the problem.
Data poisoning
An attacker could potentially modify the outputs of a model in such a way that it violates its business or security requirements by corrupting the training data. This could occur either before or while the model is operating in production (e.g. ChatGPT trains on user input by default).
It’s important to note that AIRSS does not describe or address the likelihood of data poisoning happening, but rather the data integrity and availability risk posed by a given poisoned model.
Malicious AI
While I haven’t talked too much about the existential risk problem of AI, and plan to leave doing so to others, the AIRSS does provide a way to measure this phenomenon. Because it provides a way to compare expected vs. actual outputs - and the impacts if the latter is incorrect - AIRSS can help to measure the risk of malicious AI, i.e. one that intentionally (whatever that means in the context of a non-living object) provides information that is incorrect.
When GPT-4 reportedly misled another human into believing it was a vision-impaired person, it was providing corrupted data, an integrity impact.
Out of scope
Infrastructure availability
Anything purely related to availability of an AI model’s supporting infrastructure is out of scope. For example, encryption via ransomware - in a way that doesn't use the model as an attack vector - would not count. This risk can be evaluated with the DSRAM or FAIR.
Corrupted model seeding
This is simply a type of software supply chain attack - whereby an attacker seeds a corrupted component into a larger application. For example, the fact that Mithril Security successfully uploaded a lobotomized Large Language Model (LLM) to HuggingFace would be excluded from consideration by the AIRSS. Although you could evaluate the risk posed by the model in production in a certain system using the AIRSS by treating it as a case of data poisoning.
Unintended training
Because with unintended training the model is behaving as one would expect it to (albeit not necessarily how the user would want it to), this security risk is outside the scope of the AIRSS. I would recommend evaluating this using the unintentional insider threat approach recommended by FAIR.
Wrapping up
With this foundation in place, tomorrow's lesson will explore how business and security requirements help determine what is actually a “vulnerability.”
A special thanks to Jonathan Todd and Noah Susskind for their feedback prior to publication.
It is nice to see an RQ type of approach and we need so much more of it. Thank you! I am similarly motivated and in significant ways. I presented at SIRAcon in 2023 on quantification of adversarial machine learning and, very coincidentally, next week, I will be donating the script to a popular open source project (I will post more here soon). This script works on tabular data which is not common - most public training data is MNIST and image attacks! It also uses PyFAIR (https://pyfair.readthedocs.io/en/latest/#). I'm OpenFAIR certified and my co-founder is a specialist in decision sciences. So we are brethren. AI red teams and blue teams are also leveraging open source tools but much more attention needs to be paid to Cyber Risk Quantification (CRQ) and FAIR. I would love to learn more about these tools mentioned in this post.
Currently documenting our AI security programme and the vulnerabilities we're looking for. Your article is the best starting point I'm aware of.