8 ways to sell AI to enterprises while securely training on their data
Win big deals without signing up to do the impossible.
There is no AI without training data.
The AI hyperscalers (OpenAI, Google, etc.) get training data from:
inputs from free-tier users into their tools (e.g. ChatGPT, Gemini)
publicly-available information
licensed data sets
to build and improve their models. They generally exempt enterprise customers from AI training on content, avoiding unintended training.
This approach works for generalist large language models (LLMs) because:
Individuals get access to AI tools for free
The AI hyperscalers improve their products based on the data
Enterprise customers pay the hyperscalers to provide their products
But this system breaks down for specialized data not available from sources 1-3
Examples include:
Machine telemetry
Software performance data
Salary and retention information
Companies understandably view this as sensitive intellectual property. At the same time, they need to optimize their business using AI to analyze it. So they have two options:
Build the AI model(s) themselves
Have a vendor do it
Option 1 is expensive and distracts from the primary business of the company. For example, a cabinet maker might collect lots of machine telemetry, software performance data, and human resources information from its operations. But building bespoke AI models to process it would basically require it to stand up 3 different product divisions that have nothing to do with its core mission of building cabinets.
So many businesses turn to specialist AI vendors.
Outsourcing saves money because the vendor is only focused on analyzing one type of data. And it can improve performance because the vendor can train its model on a much larger set of similar data.
Unless, of course, the customer bans the vendor from training on its data.
Which I increasingly see in enterprise contracts.
These bans stifle a vendor’s ability to use customer data to enhance AI performance: the very reason driving the relationship to begin with.
How customers and AI vendors can get business value while managing risk
1. Contractually carve out categories of data on which vendors can train
While still maintaining a “default deny” approach to AI training, customers can allow vendors to train on specific types of data. For example, an agreement might specify that a vendor building a predictive maintenance model can train on only the following characteristics of the customer’s manufacturing equipment:
Date of last service
Date of manufacture
Uptime vs. downtime
Revolutions per minute
Defect rate in finished products
The customer could also only share this type of data with the vendor, preventing unintended training on a technical level.
2. Anonymize training data on a personal and organizational level
While anonymization is primarily a privacy control and can miss things like trade secrets, removing any connection between the data and the organization or people creating it can reduce the potential for damage. Especially with aggregated data from many different organizations, the likelihood of an AI model reproducing information damaging to a specific customer is lower than without anonymization.
There remains, of course, the possibility of:
External sensitive data generation. A sufficiently powerful algorithm can infer (and expose to others) sensitive information from seemingly innocuous bits of data, essentially reversing anonymization. It’s conceivable that an unethical customer might attempt to manipulate the AI vendor’s product to unearth information about competitors with this method.
Internal sensitive data generation. From a compliance perspective, it’s possible that a model trained on the customer’s data reproduces personal information it wasn’t even exposed to. This can create regulatory and business risk if a data subject ever seeks to stop this processing of personal data.
3. Forbid the vendor from selling the model (or itself) to a customer (or its competitor)
Unlike tools like ChatGPT, where OpenAI wants to give direct access to as many people as possible, specialist AI tools have a narrower customer base. It could be that direct competitors would want to use the same vendor for training AI models because it is a specialist with access to the most and best data.
This, of course, raises the possibility that one of the larger customers simply decides to in-source the analytics work by buying the vendor.
This would be a major business threat to the vendor’s other customers who compete with the buyer. The buyer would have access to - at a minimum - an AI model optimized for the specific industry use case. It might be able to access the (potentially even non-anonymized) training data of competitors.
The best solution here would be for the vendor to agree not to sell itself or its models to anyone competing with its customers. While entrepreneurs generally don’t want to rule out a certain type of acquirer, doing so will be necessary in this case. And the vendor can still sell itself to one of its competitors or a financial buyer (e.g. private equity firm) without breaking the agreement or worrying its customers.
An additional control would be to:
4. Require the vendor to destroy the training data after a set period of time
There remains the risk of the vendor deciding to just breach its contractual requirements anyway and sell itself to one of its customers (or competitors of a customer). This might be because:
The vendor doesn’t think anyone will notice or remember.
Its buyer has an enormous legal team to fight challenges.
Some other reason.
To mitigate this risk, customers of the vendor can require the vendor to delete customer data used in training. Obviously there will be a tension between the vendor’s technical need to use the data for training subsequent models and the customer’s confidentiality concerns. With that said, a time limit on retention will address both the specific business concerns I raised as well as general cybersecurity ones.
The vendor can build customer trust in its process by:
Explaining how the deletion pipeline works
Allowing a 3rd party to audit the deletion
Providing certificates of destruction
5. Use synthetic data for training
A technically savvy and security-focused customer could offer the vendor synthetic data for training instead of real-world information. Or the vendor could commit to generating synthetic data from the customer’s information and only using the former for training. Synthetic data is machine-generated to mimic live information and can help alleviate some security and privacy concerns. But it is not a perfect solution because:
It requires training some sort of algorithm in the first place to generate the synthetic data. This could bring us right back to the original problem.
Synthetic data is always going to be lower quality than real data. This will reduce the effectiveness of any AI models trained on it.
Taking this problem to an extreme, training on synthetic data could lead to model collapse. In this scenario, iterative rounds of training on AI-generated data can reduce the model’s effectiveness to the point where it become useless.
6. Clearly assign ownership of the resulting model
If the customer claims models trained on their data is their property, this would represent an existential threat to the business model of the vendor. Thus, customers might consider relinquishing ownership of the trained model (assuming all other contractual conditions are meet).
In any case, being clear about model ownership will reduce the risk of future disputes.
7. Offer enterprise customers discounts for opting-in to training
Money talks.
And since all security and compliance risks are business risks, offering discounts to compensate for data confidentiality concerns with AI training might work. Vendors will need to be careful in pricing these discounts correctly, though. Because if they aren’t steep enough, no enterprise customer will take advantage of them. Iterative testing over time will highlight the right level.
This approach gives customers the option to:
Allow training on their data if they don’t consider it crucial to their business needs.
Otherwise, disallow it by paying more.
8. Customers accept the risk of the vendor training on data that is confidential but not critical to their business
Ultimately there is going to be some risk involved with giving data to a 3rd party. The arrangements I propose mitigate, transfer, and avoid some of it. To capture the business value of outsourcing AI model development to 3rd party, though, business leaders will need to accept some residual risk.
Especially for data that isn’t core to business operations, this might be a wise choice.
Need help negotiating enterprise contracts related to AI training?
StackAware helps AI-powered companies navigate these unique business and security challenges. By building National Institute of Standards and Technology (NIST) Risk Management Framework (RMF) and ISO/IEC 42001:2023 compliant governance programs while not crippling your business, we let you manage risk while winning with AI.
So if you are a:
Tech, security, or compliance leader
At a heavily-regulated AI-powered company
Who needs to close enterprise deals