Just describing what type of AI training is occurring can be difficult without detailed knowledge of a company’s operations. But for risk management purposes, you need clear ways of describing what is going on.
This helps you:
Mitigate unintended training and sensitive data generation risk
Reassure your customers about 4th party AI processing
Comply with applicable laws and regulations
So here are the six ways StackAware describes AI training:
First, choose from one of two types of AI.
1. Predictive AI
A system, model, or algorithm that generates outputs only numbers, categorical labels, or combinations thereof, like:
Linear regression
Logistic regression
k-means clustering
2. Generative AI
Same as above but outputs anything more than numbers or categorical labels, like:
Large language models
Image generation models
Next, choose the type of data trained on:
Check out our post on data classification for details, but to summarize, there are three types of information we consider:
1. Public data
Anything that can be posted on - or accessed from - the open internet without authentication.
This includes things like:
Inbound application programming interface (API) calls or clicks
Publicly-available data sets
Content on web pages
2. Confidential data
Information for which an organization is either:
Under an obligation of confidentiality to protect
Or just doesn't want released (trade secrets, etc.)
3. (Pseudo)anonymized data
This is where things get grayer, but it includes:
Usage statistics where the user or organization is not identified
Images that cannot identify a person (faces blurred, etc.)
Other data modified to obscure the owner
We also make these assumptions
All telemetry, service, and usage data from logged-in users is
confidential
.If a user is not logged in, the data is
public
.A vendor is doing whatever they assert the right to do in their:
Responsible AI statement policy
Privacy statement/policy
Terms & conditions
Trust center
Across these two dimensions, you get six types of AI training:
predictive-ai-training-public-data
generative-ai-training-public-data
predictive-ai-training-confidential-data
generative-ai-training-confidential-data
predictive-ai-training-anonymized-confidential-data
generative-ai-training-anonymized-confidential-data
And if you are really advanced, you can use these as formulations
in the CycloneDX Software Bill of Material (SBOM) standard, like we do.
This also integrates with the Artificial Intelligence Risk Scoring System (AIRSS) and lets you completely describe a given system or application.
Need help understanding AI training?
As part of of our risk assessment and governance offering, we’ll map this throughout your entire supply chain. And give you actionable recommendations to manage risk.
So if you are a security or compliance leader in:
Financial services
Healthcare
B2B SaaS