[wp-trac] [WordPress Trac] #62257: Enhancement: Add known AI Crawler bots to robots.txt to prevent crawling content without specific consent

Fri Oct 18 17:55:08 UTC 2024

#62257: Enhancement: Add known AI Crawler bots to robots.txt to prevent crawling
content without specific consent
-------------------------+-----------------------------
 Reporter:  rickcurran   |      Owner:  (none)
     Type:  enhancement  |     Status:  new
 Priority:  normal       |  Milestone:  Awaiting Review
Component:  Privacy      |    Version:  trunk
 Severity:  normal       |   Keywords:
  Focuses:               |
-------------------------+-----------------------------
 This change / enhancement is intended to add known AI Crawler bots as
 disallow entries to WordPress' virtual robots.txt file to prevent AI bots
 from crawling site content without specific user consent.

 This is done by changes to the `do_robots` function in the `wp-
 includes/functions.php`, this updated code loads a list of known AI Bots
 from a JSON file `ai-bots-for-robots-txt.json` add creates a `User-agent:`
 entry for each one and disallows their access.

 **Why is this needed?**
 My perspective is that having AI bots blocked by default in WordPress is a
 strong stance against the mass scraping of people’s content for use in AI
 training without their consent by companies like OpenAI, Perplexity,
 Google and Apple.

 Microsoft’s AI CEO Mustafa Suleyman was quoted recently saying:

 ''“With respect to content that is already on the open web, the social
 contract of that content since the 90s has been that it is fair use.
 Anyone can copy it, recreate with it, reproduce with it. That has been
 freeware, if you like. That’s been the understanding,”''

 This statement seems to be saying the quiet part out loud, that many AI
 companies clearly believe that because content has been shared publicly on
 the web it is available to be used for AI training ''by default'', so
 unless the publisher specifically says that it should not be used then it
 is no problem for this to be crawled and absorbed into their AI models.

 I am aware that plugins already exist if people wish to block these but
 this is only useful for people who are aware of the issue and choose to
 block it, whereas I believe consent should be ''requested by these
 companies'' and given rather than the default being that companies can
 just presume it’s ok and scrape any websites that don’t specifically say
 “no”.

 Having 43%+ of websites on the internet suddenly say “no” by default seems
 like a strong message to send out. I realise that robots.txt blocking
 isn’t going to stop any of the anonymous bots that do it but at least the
 legitimate companies who intend to honour it will take notice.

 With the news that OpenAI is switching from being a non-profit
 organisation to a for-profit company I think a stronger stance is needed
 on the default permissions for content that is published using WordPress.
 So whilst the default would be to block the AI bots there would be a way
 for people / publishers to allow access to their content by using the same
 methods currently available to modify ‘robots.txt’ in WordPress, plugins,
 custom code etc.

 (Apologies if I am missing information here, this is my first time pushing
 code via Trac / Github so I am still finding my feet with the process!)

-- 
Ticket URL: <https://core.trac.wordpress.org/ticket/62257>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform