TL;DR
A new controversy regarding AI training on customer data has erupted, this time with the instant messaging company Slack. It appears to have begun when
, author of (a Substack I recommend) highlighted some issues with Slack’s “Privacy Principles” on Threads.1Upon analysis, it doesn’t appear that what Slack was doing, or even claiming the right to do, is especially concerning from a security or privacy perspective.
The incident underscores how important clear communication is when rolling out AI-powered features. It also highlights several common misconceptions.
Looking at Slack’s Privacy Principles for search, learning, and AI
The pre-controversy version wasn’t especially concerning
While as of April 26, 2024 (before the latest press attention), the page did not distinguish between generative and other forms of AI, it did note that:
For any model that will be used broadly across all of our customers, we do not build or train these models in such a way that they could learn, memorise, or be able to reproduce some part of Customer Data.
It also laid out some (example, but not a comprehensive list of) use cases in which it was training AI:
Channel recommendations
Emoji suggestion
Search results
Autocomplete
The first three suggest outputs that are categorical in nature:
this channel or that
smiley face vs. thumbs up
search result 1 vs. search result 2
None of these would be capable of exposing customer data to those not authorized to see it. The final one is a little gray, but Slack explains:
We protect data privacy by using rules to score the similarity between the typed text and suggestion in various ways, including only using the numerical scores and counts of past interactions in the algorithm.
This would seem to take care of any potential data leakage concerns.
After the controversy began, Slack updated its AI privacy principles, which weren’t especially concerning to begin with, but also added some confusion
As of May 19, 2024, the company added a section on Generative AI, which specifically notes:
Slack does not train LLMs or other generative models on Customer Data.
Confusingly, though, the sentence beforehand states “[n]o Customer Data is used to train third-party LLM models [sic].” This implies that the restrictions on training on customer data apply only to third-party AI Large Language Models (LLM), while the sentence following it is far more expansive.
Recommendation to Slack: just cut this sentence.
Otherwise, I’m not that concerned about what Slack is doing here. You also have the ability to opt-out if you would like, by sending an email to the company.
Slack AI’s marketing page isn’t perfectly accurate
This is where things get a little more challenging. Obviously it’s harder to distill things down accurately in marketing-speak, but I think the below statement is too broad:
Slack is definitely using “your data” to train Slack AI. It’s just not doing it in a way that presents security or privacy concerns to me.
Recommendation to Slack: replace the first two sentences with something like “we protect your data.”
Slack AI is vague about its data retention period
According to the separate company blog post How We Built Slack AI To Be Secure and Private:
Where possible, Slack AI’s outputs are ephemeral
Based on the above, a key question here would be whether Slack AI data retention periods differ from that for other features like normal messages. If responses are only stored as long as the underlying data on which they are grounded, there is no real issue to worry about.
If, however, prompts are retained on a separate schedule from normal messages, that could create some concerns.
For example, assume an organization has a 180 day retention period for messages. For 6 months, employees of a Slack customer are talking about super-secret project X, after which the company abandons it. One year later (180 day retention period plus 6 months of working on the project), there should be no record of project X in the company’s Slack environment.
That is, unless Slack AI prompts or responses have a longer than 180 day retention period.
The blog post also states:
[W]e built in special support to make sure that derived content, like summaries, are aware of the messages that went into them; for example, if a message is tombstoned because of Data Loss Protection (DLP), any summaries derived from that message are invalidated.
Similarly, if a summary is “invalidated,” it would be important to know whether it is deleted as well.
Slack - like many organizations - is a little unclear about how customer data is handled
The first paragraph heading of their article is titled “Customer data never leaves Slack.”
This is a pretty sweeping statement that is not literally or even figuratively true.
Slack later clarifies that it never allows “customer data to leave Slack-controlled [virtual private clouds] VPCs.”
This is a slightly different statement than the first paragraph heading. And as we have seen from events like an early 2021 outage, the fact that Slack uses AWS for hosting means that the latter can directly impact the former’s data without Slack’s consent.
For some reason people keep calling Infrastructure-as-a-Service (IaaS) environments “on-premises” or treating them as if they are.
They are not.
On a good note, Slack AI appears to be affirmatively using a neutral security policy
According to the blog post:
Slack AI’s search feature, for example, will never surface any results to the user that standard search would not.
This is good practice and suggests Slack is following a neutral security policy by using a rules-based approach for providing data to the underlying AI model for retrieval-augmented generation (RAG).
Clarifying the original concerns
In Threads,
highlighted what I consider to be some misconceptions.Enterprise products are training on your data
Gergely wrote:
I always thought that you are the product when you don’t pay for a service. Slack is showing that even when you pay for it, they treat you as the product.
It’s unacceptable that this is automatic opt-in, and paying organizations are not opted out by default.
I can see why he might think this, but believe this sentiment to be unrealistic. For example, the Google Workspace Terms of Service note:
3.6 Abuse Monitoring. Customer is solely responsible for monitoring, responding to, and otherwise processing emails sent to the “abuse” and “postmaster” aliases for Customer Domain Names, but Google may monitor emails sent to these aliases to allow Google to identify Services abuse.
At a large organization, this could be thousands of emails a day, many of which include confidential data. I also think it’s fair to assume Google will be doing its “monitoring” and “identification” in an AI-powered fashion. [Update 14 July 2023] In any case, doing so is not prohibited by these terms. And for Google Workspace:
Interactions with intelligent Workspace features, such as accepting or rejecting spelling suggestions, or reporting spam, are anonymized and/or aggregated and may be used to improve or develop helpful Workspace features like spam protection, spell check, and autocomplete.
Unless an organization has the leverage to negotiate a separate agreement, there is no opt-out provision.
Finally, while this is an example of training on enterprise customer content, basically every enterprise product trains (or claims the right to train) on the underlying metadata as well.
Examining StackAware’s supply chain for AI training processes revealed as much.
It’s best to have (AI) security terms documented contractually, but companies don’t get a free pass for anything not in their terms of service or privacy policy
In response to a Slack engineer replying by citing the post How We Built Slack AI To Be Secure and Private, Gergely wrote:
An ML engineer at Slack says they don’t use messages to train LLM models [sic].
My response is that the current terms allow them to do so. I’ll believe this is the policy when it’s in the policy. A blog post is not the privacy policy: every serious company knows this.
I am not an attorney and this is not legal advice.
There are two issues here:
What Gergely linked to and described as the “policy” is not actually Slack’s privacy policy (which is here)! What he linked to is a set of “Privacy Principles.”
Whether or not text is in a document titled “Privacy Policy” is not a perfect predictor of whether it is binding on a company.
There seems to be a sentiment in the technology space that companies can say one thing in their marketing blog, social media, etc. and then another in their “official” legal documents and that the former are irrelevant from a privacy or security perspective.
At least in the United States, several regulatory agencies have made clear this is definitely not true!
In 2023, the Federal Trade Commission (FTC) wrote a threatening (and otherwise vague) blog post called “Keep your AI claims in check.” But one thing that was clear is that exaggerating about AI product capabilities can constitute a “deceptive” practice. It doesn’t say anything about these claims needing to be in the terms and conditions or privacy policy.2
In 2024, the Securities and Exchange Commission (SEC) settled charges with two investment advisors who it alleged made false and misleading claims about their AI use. The SEC specifically cited press releases, website articles, and social media posts as the offending content.
I certainly agree that getting contractual assurances about security and privacy measures is the best way to go. And you may not have a private right to litigate for anything not contained therein. But with that said, companies definitely do not get a free pass on content outside of their privacy policy and terms and conditions.
I am not an attorney and this is not legal advice.
Slack’s PR disaster was mainly that: a communications fumble
Even before their clarification, Slack’s “Privacy Principles” would pass muster if I were evaluating them for a StackAware client. I think they still need some additional wordsmithing, but I don’t have major concerns with their current state either.
In terms of magnitude, this issue is much closer to Zoom’s stumble last summer. While what the company ended up actually doing using AI wasn’t that concerning to me, their communication was ham-handed.
This contrasts starkly from DocuSign’s generative AI rollout earlier this year. As I documented in a series of LinkedIn posts, what they said they were doing either didn’t make sense or was very concerning from a security perspective.
So, what should you do if you are a company rolling out AI features and don’t want to make the same mistakes?
I put together a set of transparency principles in this blog post.
And if you need help planning a secure and compliant AI product launch, let us help.
The issue seems to have surfaced in Hacker News at roughly the same time. I’m not sure who actually kicked off the firestorm.
The FTC separately has said that silently changing your terms and conditions to allow for AI training on customer data could be a deceptive practice, but this is a different issue and doesn’t contradict my point here.