While doing an AI risk assessment for a client, I found this interesting issue with a major governance, risk, and compliance (GRC) tool:
It trains AI models on your confidential data
The below popped out when reading the company’s blog:
Recognizing the unique nature of each customer's environment, our machine learning models can be designed to be tenant-specific. This means that when we train our AI systems, we use only your data, ensuring that the insights and intelligence gleaned are exclusive to your organization.
Considering the sensitivity of the information this company processes, this could be a major risk. But before my mental alarm bells rang, I saw:
The company says the AI models are available only to you
According to the company’s terms of service:
When Customer Data is used to improve COMPANY_NAME’s machine learning models, COMPANY_NAME will ensure that such Customer Data, including Personal Data, is not reproduced by the model to another customer1
Good to know.
If the company is already holding your sensitive data, there’s no marginal increase in risk by it holding AI models trained on that data.
I was about to wrap up my assessment when I saw:
The company says it owns these AI models
Also according to their terms of service:
all rights, title and interest in and to the Services and all hardware, Software and other components of or used to provide the Services and COMPANY_NAME’s machine learning algorithms, including all related Intellectual Property Rights, will remain with COMPANY_NAME and belong exclusively to COMPANY_NAME.
What constitutes this company’s (as opposed to its customer’s) machine learning algorithms wasn’t perfectly clear. The terms of service don’t define “machine learning algorithms” or “machine learning models” but the only reference to either terms is immediately following COMPANY_NAME, implying it claims ownership of all of them.
The company doesn't appear to DELETE these AI models when the customer relationship ends
A final section in their terms of service states:
Upon Customer’s written request and in accordance with COMPANY_NAME’s Customer Data Deletion and Retention Policy found in COMPANY_NAME’s Trust Center, COMPANY_NAME will make Customer Data available to Customer for export or download as provided in the Documentation…Thereafter, COMPANY_NAME will have no obligation to maintain or provide any Customer Data and COMPANY_NAME will delete Customer Data in accordance with COMPANY_NAME’s Data Deletion and Retention Policy available in COMPANY_NAME’s Trust Center unless prohibited by law or legal order.
Since it appears machine learning algorithms/models trained on Customer Data are not themselves Customer Data, I understand this section to mean the company retains these models after the end of the relationship.
Why would the company hang onto models trained on customer data?
Since the company contractually binds itself to not making these models available to other customers, I don't know why.
Retaining them creates risks for both parties because there is always the chance attackers steal the models.
The only alternative explanation is that the company might sell (or license) the models to someone aside from a customer. Doing so would be a major problem from a reputation for this (security) company. But it seems unlikely because the only organizations interested would be the company’s competitors (or cyber criminals :]).
Security, intellectual property ownership, and other norms aren't well established for AI
Assuming this setup is just an oversight and I were a customer, I would ask the company to clarify its terms of service. Specifically, I would ask that any tenant-specific machine learning models trained on Customer Data are destroyed under the same conditions as the Customer Data itself.
But you need to be aware of these issues to do anything about them. And that’s what StackAware is here for.
We help AI-powered companies manage risks related to:
Cybersecurity
Compliance
Privacy
while delivering value to customers
Need help navigating these types of issues?
Exactly how it ensures your data is not reproduced to another customer is important, and has two different aspects:
How does the company prevent models trained on one customer’s confidential information from being available to another customer? This is equally important as how the company segregates data between customers in a multi-tenant environment.
For models that are trained anonymized data and made available to all customers (which the company also does), how is the data anonymized? This is an especially important question because it’s likely the data is unstructured (e.g. free text, images, uploaded documents) and more difficult to anonymize than structured information (e.g. database columns).
With that said, these are separate issues from the main point of my post.