The only actual way to achieve AI governance

How track data classification, categorization, use, quality, bias, and retention without going crazy

Apr 07, 2024

A consistent and disciplined approach to managing your data is what you need.

As we prepare for ISO 42001 certification, a key requirement of the standard is tracking your company’s info.

Funny enough, the process is sufficiently painful that it incentivizes you to simply delete unneeded data rather than labeling it. A unexpected cybersecurity benefit!

For that which remained, here is how we did it at StackAware:

Data classification

We use five categories:

Public
Public-Personal Data
Confidential-Personal Data
Confidential-Internal
Confidential-External

I wrote a whole post on this, so check it out for details.

Additionally, and unless affirmatively determined otherwise following special processing or de-identification procedures, AI-generated data retains the classification of its training data. Thus, if a model is trained on Confidential-Internal data, it is only made available to those under an obligation of confidentiality to the company.

Data category

An entry can fit in one or more of these:

AI-training: Self-explanatory. The largest portion of the dataset which drives the development of the model.
AI-validation: used to tune hyperparameters and mitigate overfitting risk.
AI-testing: used to evaluate the fully-trained model. This is real data but not used for training.
AI-processing: information affirmatively proved for use with AI systems. Restrictions on merely using (not training on) customer data in AI systems continue to crop up in enterprise agreements. While optimally you would negotiate these away, if you must agree to them, you’ll need a way to tag data approved for AI use (implicitly rejecting all other types for processing).
AI-generated: information wholly created by an AI system (i.e. excluding anything where a human controlled the final output, such as with autocomplete in a word processor), and which includes all the biases of the underlying system (so we also use this as a bias categorization). We don’t train on this, to avoid model collapse.
Diagnostic-testing: separate from AI-testing, this is usually fake or example data used to evaluate business logic. Neither it nor its output drives business operations.
Production: real-world data processed by AI or other systems and on which we make decisions.

Data intended use

This is likely to evolve over time, and more categories will accumulate, but for now we use:

Human resources
Admin and finance
Sales and marketing
Product development
Security and compliance

Data quality

If you’ve followed me for a bit, you know I am not a fan of qualitative categorizations. Unfortunately, I don't know of any better way to do this than:

High: straight from a known source, no reason to doubt.
Medium: unclear source, but looks okay.
Low: evidence of inaccuracy.

Data bias

The bias-variance tradeoff in artificial intelligence and machine learning is unavoidable.

Same goes with the training data. To help illuminate this, I’m using these categories:

Immaterial bias: this is stuff like technical standards and representations. Since people created these things, they're biased. But it's hard to see how it would substantially impact a model trained on or processing this data.
Anecdotal: the opinions or experiences of a few or even just one person; not necessarily representative of reality or the norm.
Unrepresentative sample: this of course will depend on the eye of the beholder, but a pretty obvious example would be if I am training an LLM-powered chatbot on Slack message between security team members during an data breach. You would probably get more terse responses than desired!
Conflict of interest: Whenever a person or organization is providing information about itself (or an interested partner), there is an inherent conflict of interest due to the desire to put the best foot forward and downplay negative information.
AI-generated: see above.

Data retention

At least in our back-end, if it’s Personal Data and we haven’t interacted with the data subject for 6 years, the record gets wiped. We have different retention policies and mechanisms for other systems, but this is the standard.

Obviously that would change in case of legal hold, but I think that balances business, security, and privacy needs nicely.

AI governance is data governance

By using table references to link back to these definitions, we can ensure consistent labeling and handling throughout our stack.

As your AI management system gets more complex, ISO 42001 in particular and governance in general can become a beast.

Need help prepping for your audit?

Check out the AIMS accelerator

Related LinkedIn Posts

Two extreme takes on tagging AI-generated content

Deploy Securely

Discussion about this post