Block AI tools from scraping your site in 3 minutes
How to tell bots and humans to buzz off (although they might not listen).
If you want to block major generative AI tools from scraping your web site:
Add the below to your robots.txt file:
Under advisement of competent legal counsel, consider changing your site’s terms and conditions to forbid the undesired activity. Understand the potential limitations of such moves.
Why you should (or shouldn’t) block AI bots
If you publish only basic content on your web site and want it to be more likely to be referred to when users query ChatGPT or any generative AI tool, then scraping isn’t necessarily a problem.
If, however, you have concerns about your copyrighted material being used in these tools, you might consider blocking them by modifying your robots.txt file by adding the above restrictions. I combined information from several different articles together to build this consolidated list on how to block ChatGPT, Google Bard, Anthropic, Twitter/X, Meta, the Common Crawl bot, and several others.
You can also block these tools from specific portions of your site.
But know that this approach is not foolproof or airtight and will only stop constrained actors who respect robots.txt files (they may not even be required to - see below). Additionally, new generative AI tools are popping up all of the time so you will likely be playing “whack-a-mole” to block each new one as it appears. Check darkvisitors.com for updates (h/t to Dino Dunn for flagging this for me).
And there is a balance to be had here.
It’s likely you will be penalized in search result or social media rankings as a result of this type of blocking. Substack includes its own built-in blocking functionality, and you’ll note it is not enabled for Deploy Securely:
Add teeth with terms and conditions?
If you choose to use certain NYT products or services displaying or otherwise governed by these Terms of Service, including NYTimes.com (the “Site”), NYT’s mobile sites and applications, any of the features of the Site, including but not limited to RSS feeds, APIs, and Software (as defined below) and other downloads, and NYT’s Home Delivery service (collectively, the "Services"), you will be agreeing to abide by all of the terms and conditions of these Terms of Service between you and NYT.
The next block defines “Content” and “non-commercial use”:
The contents of the Services, including the Site, are intended for your personal, non-commercial use. All materials published or available on the Services (including, but not limited to text, photographs, images, illustrations, designs, audio clips, video clips, “look and feel,” metadata, data, or compilations, all also known as the “Content”) are protected by copyright, and owned or controlled by The New York Times Company or the party credited as the provider of the Content.
Non-commercial use does not include the use of Content without prior written consent from The New York Times Company in connection with: (1) the development of any software program, including, but not limited to, training a machine learning or artificial intelligence (AI) system.
These types of “browsewrap” agreements, where a site asserts you must abide by certain conditions in order to continue using it but doesn’t get any sort of affirmative consent from the user, do not appear very effective from a legal perspective.
According to the legal software provider Ironclad, in a sample of cases from 2020 only 14% of browsewrap agreements were found enforceable when tested in court.
Additionally, in 2020, the 9th Circuit Court of Appeals ruled that scraping public web sites does not violate the Computer Fraud and Abuse Act (CFAA), according to the CHIP Law Group.
In its late 2023 lawsuit against OpenAI and Microsoft for using the newspaper’s content to train AI models without permission, the New York Times alluded to its browsewrap terms (above) but did not specifically note either company’s alleged violation of them as a numbered count in the complaint. Rather, its case focused on copyright infringement, Digital Millennium Copyright Act (DMCA) violations, trademark dilution, and unfair competition.
Balancing risk and reward with AI
As with pretty much all technology and security decisions, how to address AI tools scraping your site is ultimately a business-driven one. Unfortunately it’s going to be just one of dozens, if not hundreds, you’ll need to address as the pace of artificial intelligence development accelerates.
Are you looking for a trusted parter and advisor to help you manage the:
issues stemming from expanding use of AI?