Pharma AI security: top 5 risks - and how to mitigate them
Protecting data confidentiality, integrity, and availability in AI-powered drug manufacturing.
AI has enormous potential to revolutionize the pharmaceutical industry.
By assisting researchers with:
protein design
automating previously manual tasks
developing novel approaches to sticky problems
artificial intelligence will accelerate the development of cures for many diseases.
As with any new tool, there are potential security risks. Toward that end, I put together a list of the top 5 problems I see, as well as potential mitigations.
1. Unintended training leading to PHI leakage
Especially because Large Language Models (LLM) can take inputs in almost any format and their context windows are exploding, it’s tempting to just dump all your data into them.
If the model accidentally trains on inputs containing protected health information (PHI), like from a clinical trial, this may be a big problem. Especially dangerous in this respect are Software-as-a-Service (SaaS) tools which train on inputs by default, like ChatGPT.
And if it can regurgitate this raw data to those not under the appropriate confidentiality burden (e.g. signed business associate agreement [BAA]), you can be looking at a hefty fine here.
The biggest risk is probably in your supply chain, where less sophisticated vendors are experimenting with AI themselves.
Here are some potential mitigations:
Administrative and technical safeguards
Workforce training is the best way to minimize this risk internally. Having a clear policy which doesn’t just impose a blanket ban (driving AI use into the shadows) is a good place to start. Asking hard questions about vendor AI use - and controlling it contractually - will also be helpful here.
On the technical side, you can look at open source or commercial tools to block potentially sensitive inputs.
Data anonymization and minimization
Before training models, anonymize PHI to remove or obfuscate identifiable information, ensuring data can’t be traced back to individual people. Even if you remove obvious identifiers like names, other things like exact date of birth might not be necessary to preserve the model’s accuracy.
Architectural review
Engineers may be tempted to configure data pipelines in the quickest and most maintainable way possible, which is understandable from a purely business perspective. This can take the form of configuring an application program interface (API) to return every field in a database, for example. Unfortunately, this increases the likelihood of unintended training.
2. Data poisoning causing model malfunction
There is already enough controversy about training data sets for AI models used in healthcare. Imagine what would happen if a malicious actor is able to seed corrupted data into one?
This could lead to a model making (intentionally) inaccurate recommendations for drug design, potentially even leading to physical harm.
Especially if you are ingesting data from non-proprietary, public sources (and even if you aren’t), having safeguards in place here is key. These can include:
Thorough vetting of data sources used for training
Researchers have demonstrated the ability to reliably change the outputs of AI models by making imperceptible (to the human eye) changes to their training data. Considering the potential stakes involved, it’s quite conceivable that attackers might try to do the same to drug design or other pharmaceutical AI models.
Strict review procedures for data (especially if taken from the open internet) are a must here.
Automated review of open source AI libraries
This addresses a slightly different attack vector: the seeding of already-corrupted AI models into your technology stack. For example, Mithril Security (a StackAware partner) demonstrated how plant a pre-poisoned model in Hugging Face that provided incorrect responses to a narrow set of inquiries.
This is by no means an outlier. After the release of ChatGPT, there was an explosion in attempted software supply chain attacks using Python libraries. It is one of the most commonly-used languages for AI engineering, so cybercriminals rushed to plant malicious libraries in package managers like PyPI.
Deploying tools like software composition analysis (SCA) to detect malicious packages is your best bet here.
3. Tainted intellectual property (IP) of medicines developed with AI
The USPTO did the industry a solid by making clear the mere use of AI doesn't prevent a company from securing a patent for a drug. And generative AI-powered research will accelerate as a result.
With that said, if you aren't excruciatingly clear about who owns the outputs of AI training - and segregate the training data and resulting models appropriately - you could have a legal dispute on your hands.
One potential area of concern could arise from the use of commercially-available LLMs like OpenAI’s ChatGPT or potentially even “open source” ones like Meta’s Llama to generate training data. Both companies forbid - to varying degrees - the creation of subsequent LLMs with outputs from their products, which can give pharmaceutical manufacturers pause. This would be especially true if OpenAI or Meta were to ever attempt to move into this field.
As always, I’m not a lawyer and don’t give legal advice. But I did write an article with an attorney touching on this topic.
4. Theft of drug design applications and models
Even if the IP ownership of an AI model or the proteins it designs is uncontested, that doesn't necessarily stop someone from stealing it!
Throughout the development of the COVID-19 vaccine, a host of nation-state cyber actors allegedly tried to exfiltrate data about its design, including:
North Korea
Russia
China
Iran
Protecting what you have built is absolutely vital to monetizing it! Doing this comes down to the basics. But I’ll note it isn’t necessarily more secure to host AI models yourself (instead of relying on a 3rd party SaaS provider). A thorough build-vs.-buy security analysis is necessary.
5. Indirect prompt injection against internal LLMs
Pharma companies are unlikely to deploy customer-facing chatbots with the ability to impact ongoing R&D, but that doesn’t mean you can ignore prompt injection.
Even if the model's training data isn't tainted, it’s quite possible an attacker will leave malicious instructions on public web sites with the expectation they will be fed into LLMs.
Researchers are certain to add information from public sources into their prompts of internally-facing AI tools. This means that spoofing or even taking over commonly-visited sites to seed malicious (and invisible) HTML into them can be a valuable investment (for an attacker).
Especially considering the length to which countries like Russia and China have gone to infiltrate western software supply chains, creating these types of “waterholing” targets is well within their capabilities and resources. Mitigations include:
Allowlisting websites for use with LLMs
Even if a company isn’t your vendor, if you are scraping data from their website, they are part of your data supply chain. You rely on the integrity of their data (and their ability to preserve it) to a degree.
Thus, it might make sense to create a database of approved websites against which you deploy internally-developed AI applications to retrieve information. If you limit these targets, you lower the risk of:
Accidentally scraping a site that was created for malicious purposes.
Intentionally scraping a bona fide site that an attacker has taken over specifically to conduct indirect prompt injection.
Use AI models to evaluate data for malicious content prior to processing
Some have proposed deploying two or more LLMs, one with a gatekeeper role and the other in an execution role, to mitigate such prompt injection. This would hypothetically work by having the gatekeeper LLM look specifically for malicious content, block passaging of it to the execution LLM if detected.
Others have proposed technical modifications to AI models to defeat adversarial attacks.
While these may be able to mitigate some prompt injection risks, researchers have demonstrated a persistent ability to defeat most of these safety layers. So these methods are by no means silver bullets.
Conclusion
The benefits to humanity from AI-assisted pharmaceutical breakthroughs will be vast and I am deeply optimistic about the future. Protecting these critical developments from attackers, however, is not a trivial task.
If you are concerned about:
Protecting your intellectual property
Preserving patient privacy
Keeping R&D on track
while using AI, get it touch.
Thanks to Pradeep Bandaru for his comments prior to publication.
Related LinkedIn posts