The only actual way to achieve AI governance
How track data classification, categorization, use, quality, bias, and retention without going crazy
A consistent and disciplined approach to managing your data is what you need.
As we prepare for ISO 42001 certification, a key requirement of the standard is tracking your company’s info.
Funny enough, the process is sufficiently painful that it incentivizes you to simply delete unneeded data rather than labeling it. A unexpected cybersecurity benefit!
For that which remained, here is how we did it at StackAware:
Data classification
We use six categories:
Public
Public-Personal Data
Confidential-Personal Data
Confidential-Internal
Confidential-External
Restricted
I wrote a whole post on this, so check it out for details.
Additionally, and unless affirmatively determined otherwise following special processing or de-identification procedures, AI-generated data retains the classification of its training data. Thus, if a model is trained on Confidential-Internal data, it is only made available to those under an obligation of confidentiality to the company.
Data category
An entry can fit in one or more of these:
AI-training: Self-explanatory. The largest portion of the dataset which drives the development of the model.
AI-validation: used to tune hyperparameters and mitigate overfitting risk.
AI-testing: used to evaluate the fully-trained model. This is real data but not used for training.
AI-generated: This is information wholly created by an AI system (i.e. excluding anything where a human controlled the final output, such as with autocomplete in a word processor), and which includes all the biases of the underlying system (so we also use this as a bias categorization). We don’t train on this, to avoid model collapse.
Diagnostic-testing: separate from AI-testing, this is usually fake or example data used to evaluate business logic. Neither it nor its output drives business operations.
Production: real-world data processed by AI or other systems and on which we make decisions.
Data intended use
This is likely to evolve over time, and more categories will accumulate, but for now we use:
Human resources
Admin and finance
Sales and marketing
Product development
Security and compliance
Data quality
If you’ve followed me for a bit, you know I am not a fan of qualitative categorizations. Unfortunately, I don't know of any better way to do this than:
High: straight from a known source, no reason to doubt.
Medium: unclear source, but looks okay.
Low: evidence of inaccuracy.
Data bias
The bias-variance tradeoff in artificial intelligence and machine learning is unavoidable.
Same goes with the training data. To help illuminate this, I’m using these categories:
Immaterial bias: this is stuff like technical standards and representations. Since people created these things, they're biased. But it's hard to see how it would substantially impact a model trained on or processing this data.
Anecdotal: the opinions or experiences of a few or even just one person; not necessarily representative of reality or the norm.
Unrepresentative sample: this of course will depend on the eye of the beholder, but a pretty obvious example would be if I am training an LLM-powered chatbot on Slack message between security team members during an data breach. You would probably get more terse responses than desired!
Conflict of interest: Whenever a person or organization is providing information about itself (or an interested partner), there is an inherent conflict of interest due to the desire to put the best foot forward and downplay negative information.
AI-generated: see above.
Data retention
At least in our back-end, if it’s Personal Data and we haven’t interacted with the data subject for 6 years, the record gets wiped. We have different retention policies and mechanisms for other systems, but this is the standard.
Obviously that would change in case of legal hold, but I think that balances business, security, and privacy needs nicely.
AI governance is data governance
By using table references to link back to these definitions, we can ensure consistent labeling and handling throughout our stack.
As your AI management system gets more complex, ISO 42001 in particular and governance in general can become a beast.
Need help prepping for your audit?