How StackAware describes AI training

Managing risk through clear language.

Dec 04, 2024

Just describing what type of AI training is occurring can be difficult without detailed knowledge of a company’s operations. But for risk management purposes, you need clear ways of describing what is going on.

This helps you:

Mitigate unintended training and sensitive data generation risk
Reassure your customers about 4th party AI processing
Comply with applicable laws and regulations

So here are the six ways StackAware describes AI training:

First, choose from one of two types of AI.

1. Predictive AI

A system, model, or algorithm that generates outputs only numbers, categorical labels, or combinations thereof, like:

Linear regression
Logistic regression
k-means clustering

2. Generative AI

Same as above but outputs anything more than numbers or categorical labels, like:

Large language models
Image generation models

Next, choose the type of data trained on:

Check out our post on data classification for details, but to summarize, there are three types of information we consider:

1. Public data

Anything that can be posted on - or accessed from - the open internet without authentication.

This includes things like:

Inbound application programming interface (API) calls or clicks
Publicly-available data sets
Content on web pages

2. Confidential data

Information for which an organization is either:

Under an obligation of confidentiality to protect
Or just doesn't want released (trade secrets, etc.)

3. (Pseudo)anonymized data

This is where things get grayer, but it includes:

Usage statistics where the user or organization is not identified
Images that cannot identify a person (faces blurred, etc.)
Other data modified to obscure the owner

We also make these assumptions

All telemetry, service, and usage data from logged-in users is confidential.
If a user is not logged in, the data is public.
A vendor is doing whatever they assert the right to do in their:
- Responsible AI statement policy
- Privacy statement/policy
- Terms & conditions
- Trust center

Across these two dimensions, you get six types of AI training:

predictive-ai-training-public-data
generative-ai-training-public-data
predictive-ai-training-confidential-data
generative-ai-training-confidential-data
predictive-ai-training-anonymized-confidential-data
generative-ai-training-anonymized-confidential-data

And if you are really advanced, you can use these as formulations in the CycloneDX Software Bill of Material (SBOM) standard, like we do.

This also integrates with the Artificial Intelligence Risk Scoring System (AIRSS) and lets you completely describe a given system or application.

Need help understanding AI training?

As part of of our risk assessment and governance offering, we’ll map this throughout your entire supply chain. And give you actionable recommendations to manage risk.

So if you are a security or compliance leader in:

Financial services
Healthcare
B2B SaaS

Book a call

Deploy Securely

Discussion about this post