Just describing what type of AI training is occurring can be difficult without detailed knowledge of a company’s operations. But for risk management purposes, you need clear ways of describing what is going on.
This helps you:
- Mitigate unintended training and sensitive data generation risk 
- Reassure your customers about 4th party AI processing 
- Comply with applicable laws and regulations 
So here are the six ways StackAware describes AI training:
First, choose from one of two types of AI.
1. Predictive AI
A system, model, or algorithm that generates outputs only numbers, categorical labels, or combinations thereof, like:
- Linear regression 
- Logistic regression 
- k-means clustering 
2. Generative AI
Same as above but outputs anything more than numbers or categorical labels, like:
- Large language models 
- Image generation models 
Next, choose the type of data trained on:
Check out our post on data classification for details, but to summarize, there are three types of information we consider:
1. Public data
Anything that can be posted on - or accessed from - the open internet without authentication.
This includes things like:
- Inbound application programming interface (API) calls or clicks 
- Publicly-available data sets 
- Content on web pages 
2. Confidential data
Information for which an organization is either:
- Under an obligation of confidentiality to protect 
- Or just doesn't want released (trade secrets, etc.) 
3. (Pseudo)anonymized data
This is where things get grayer, but it includes:
- Usage statistics where the user or organization is not identified 
- Images that cannot identify a person (faces blurred, etc.) 
- Other data modified to obscure the owner 
We also make these assumptions
- All telemetry, service, and usage data from logged-in users is - confidential.
- If a user is not logged in, the data is - public.
- A vendor is doing whatever they assert the right to do in their: - Responsible AI statement policy 
- Privacy statement/policy 
- Terms & conditions 
- Trust center 
 
Across these two dimensions, you get six types of AI training:
- predictive-ai-training-public-data 
- generative-ai-training-public-data 
- predictive-ai-training-confidential-data 
- generative-ai-training-confidential-data 
- predictive-ai-training-anonymized-confidential-data 
- generative-ai-training-anonymized-confidential-data 
And if you are really advanced, you can use these as formulations in the CycloneDX Software Bill of Material (SBOM) standard, like we do.
This also integrates with the Artificial Intelligence Risk Scoring System (AIRSS) and lets you completely describe a given system or application.
Need help understanding AI training?
As part of of our risk assessment and governance offering, we’ll map this throughout your entire supply chain. And give you actionable recommendations to manage risk.
So if you are a security or compliance leader in:
- Financial services 
- Healthcare 
- B2B SaaS 


