[wp-trac] [WordPress Trac] #62257: Enhancement: Add known AI Crawler bots to robots.txt to prevent crawling content without specific consent
WordPress Trac
noreply at wordpress.org
Fri Oct 18 17:55:08 UTC 2024
#62257: Enhancement: Add known AI Crawler bots to robots.txt to prevent crawling
content without specific consent
-------------------------+-----------------------------
Reporter: rickcurran | Owner: (none)
Type: enhancement | Status: new
Priority: normal | Milestone: Awaiting Review
Component: Privacy | Version: trunk
Severity: normal | Keywords:
Focuses: |
-------------------------+-----------------------------
This change / enhancement is intended to add known AI Crawler bots as
disallow entries to WordPress' virtual robots.txt file to prevent AI bots
from crawling site content without specific user consent.
This is done by changes to the `do_robots` function in the `wp-
includes/functions.php`, this updated code loads a list of known AI Bots
from a JSON file `ai-bots-for-robots-txt.json` add creates a `User-agent:`
entry for each one and disallows their access.
**Why is this needed?**
My perspective is that having AI bots blocked by default in WordPress is a
strong stance against the mass scraping of people’s content for use in AI
training without their consent by companies like OpenAI, Perplexity,
Google and Apple.
Microsoft’s AI CEO Mustafa Suleyman was quoted recently saying:
''“With respect to content that is already on the open web, the social
contract of that content since the 90s has been that it is fair use.
Anyone can copy it, recreate with it, reproduce with it. That has been
freeware, if you like. That’s been the understanding,”''
This statement seems to be saying the quiet part out loud, that many AI
companies clearly believe that because content has been shared publicly on
the web it is available to be used for AI training ''by default'', so
unless the publisher specifically says that it should not be used then it
is no problem for this to be crawled and absorbed into their AI models.
I am aware that plugins already exist if people wish to block these but
this is only useful for people who are aware of the issue and choose to
block it, whereas I believe consent should be ''requested by these
companies'' and given rather than the default being that companies can
just presume it’s ok and scrape any websites that don’t specifically say
“no”.
Having 43%+ of websites on the internet suddenly say “no” by default seems
like a strong message to send out. I realise that robots.txt blocking
isn’t going to stop any of the anonymous bots that do it but at least the
legitimate companies who intend to honour it will take notice.
With the news that OpenAI is switching from being a non-profit
organisation to a for-profit company I think a stronger stance is needed
on the default permissions for content that is published using WordPress.
So whilst the default would be to block the AI bots there would be a way
for people / publishers to allow access to their content by using the same
methods currently available to modify ‘robots.txt’ in WordPress, plugins,
custom code etc.
(Apologies if I am missing information here, this is my first time pushing
code via Trac / Github so I am still finding my feet with the process!)
--
Ticket URL: <https://core.trac.wordpress.org/ticket/62257>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform
More information about the wp-trac
mailing list