Data classification mastery: 6 simple categories for handling sensitive information
Start AI governance the right way with an actionable data governance system.
Check out this white board session where I go through this approach.
AI makes companies even hungrier for data.
This makes cybersecurity and privacy even more important.
A key component to any effective AI governance program is a data classification policy and set of procedures. If you can’t identify what your data means and how sensitive it is, you will have a tough time protecting it.
Most organizations use ascending levels of data sensitivity, but shouldn’t
Unfortunately, most of the systems I have seen leave something to be desired. For example, I often see ascending levels of sensitivity like:
Public
Internal
Restricted
Confidential
Each level is increasingly valuable to the organization and thus many companies create this graduated series of data classifications. Additionally, I will also see other guidance implicitly create additional categories by referring to “sensitive data,” even though it is never defined in the data classification policy
Unfortunately, having this ascending (and somewhat unclear) structure is usually unnecessary and counterproductive.
That is because organizations rarely create separate systems or handling processes for these different classifications.
How StackAware labels data
We use the following top-level classifications:
Public
Public-Personal Data
Confidential-Personal Data
Confidential-Internal
Confidential-External
Restricted
These are different not in terms of the value of the data they describe but rather in terms of the handling procedures for the information. Additionally, these terms can be further sub-divided into more granular categories.
“Restricted” is a catch-all term that includes everything which is not Public.
Public
Self-explanatory. This is anything that can be posted on the open internet without restriction. At StackAware, we take the additional step of trying affirmatively to publish anything classified as “Public.” That’s because, if there is no risk in getting out there, then it might as well serve as marketing collateral.
Building in public is part of our competitive advantage.
Public-Personal Data
I am not an attorney and this is not legal advice.
There are a variety of regulations governing the use of data which can identify natural persons (i.e. human beings), such as:
Personal information (PI), defined by the California Consumer Privacy Act (CCPA).
Personally identifiable information (PII), defined by various U.S. state laws and federal regulations.
Personal data, defined by the European Union (EU) General Data Protection Regulation (GDPR).
The GDPR is the most restrictive and expansive of all of these categories. And it applies to any “processing” whatsoever, which covers a wide swath of activities. Thus, we use the blanket term of personal data for everything falling under this definition.
Anecdotally, I know that many other organizations simply build their data privacy programs around GDPR because it is the most stringent standard and they don’t need to worry about complying with a variety of different rules.
To meet its requirements, we have a specific category of Public-Personal Data to cover anything subject to the regulation but which can nonetheless be made public.
Examples include:
My LinkedIn profile
Podcast recordings of me saying my name
Videos containing my face that I post on YouTube
Personal Data not provided under any obligation or expectation of confidentiality falls into this category. Because there are specific handling requirements required by rules like the GDPR, though, it needs a different category than just “Public.”
Confidential-Personal Data
Any Personal Data that has been provided under an obligation or expectation of confidentiality falls into this separate category.
Different jurisdictions have specialized requirements for handling different types of data, so it might make sense to create nested categories under the general category of Personal Data.
For example, if you are a “covered entity” under the U.S. Health Insurance Portability and Accountability Act (HIPAA), you might consider creating a sub (or entirely separate) category for protected health information (PHI). PHI has its own handling requirements mandated by HIPAA.
Confidential-Internal
This is information belonging to us not currently meant to be public, but which can be disclosed unilaterally with the approval of the data owner, without further coordination outside of our company.
Below are some examples. If you were especially concerned about compartmenting data, you could create nested sub-categories using some or all of the below to restrict dissemination even further:
Confidential-External
This is information belonging to another organization not meant to be public, which our company is bound by confidentiality requirements to protect and cannot release without external coordination.
While there should be an internally designated data owner, that person cannot release the information in question without permission from the relevant external party.
Restricted
This is simply a way of collectively describing the aforementioned three types of Confidential information. For policy and procedure purposes, it can be helpful to have a single unifying term to describe all categories of data that can not be processed in a certain way (e.g. using uncertified systems per StackAware’s AI security policy).
Don’t make things more complex than they need to be
StackAware’s system is roughly the same level of complexity as most of the data classification systems that I have seen in my career, but far more actionable.
The key is to make things only as complicated as they need to be, but no more.
Otherwise, people just give up trying to follow the system at all.
The good news is, StackAware can tailor your data classification (and entire AI governance program) to your specific needs.
Interested in learning more?
StackAware is not a publicly-traded company so has more latitude to disclose than if it were. For public companies, where such information is heavily regulated by the U.S. Securities and Exchange Commission (SEC), it might make sense to create a sub-category specifically for material non-public information (MNPI). Again, this is not legal advice. Consult competent counsel.
Secrets refer to anything that are not - by themselves - sensitive, but which directly grant access to sensitive data. Thus, they require protection at the same level as they data which they guard. Examples include:
Application program interface (API) keys
Physical security codes
Passwords
Nice primer ! it boggles my mind how often data classification gets overlooked for the more sexier parts of cybersecurity. It pretty much forms the basis for effective access control
Thank you for sharing this scheme. It makes sense to me.