[wp-trac] [WordPress Trac] #60805: Reading Settings: add option to discourage AI services from crawling the site

Tue Mar 19 12:06:28 UTC 2024

#60805: Reading Settings: add option to discourage AI services from crawling the
site
-----------------------------+-----------------------------
 Reporter:  jeherve          |      Owner:  (none)
     Type:  feature request  |     Status:  new
 Priority:  normal           |  Milestone:  Awaiting Review
Component:  Privacy          |    Version:
 Severity:  normal           |   Keywords:
  Focuses:  privacy          |
-----------------------------+-----------------------------
 I'd like to suggest a new addition to the bottom of the Reading Settings
 screen in the dashboard:

 [[Image(https://cldup.com/p6xw24IFff.png)]]

 This new section would help site owners indicate whether or not they would
 like their content to be indexed by AI services and used to train future
 AI models.

 There have been a lot of discussions about this in the past 2 years:
 content creators and site owners have asked whether their work could and
 should be used to train AI. Opinions vary, but at the end of the day I
 believe most would agree that as a site owner, it would be nice if I could
 choose for myself, for my own site.

 In practice, I would imagine the feature to work just like the Search
 Engines feature just above: when toggled, it would edit the site's
 `robots.txt` file and disallow a specific list of AI services from
 crawling the site.

 ----

 There are typically 4 main approaches to discouraging AI Services from
 crawling your site:

 1. You can add `robots.txt` entries matching the different User Agents
 used by AI services, and asking them not to index content via a `Disallow:
 /`.
     - **This seems to be the cleanest approach, and the one that AI
 services are the most likely to respect.**
     - This also has an important limitation ; it relies on a list of AI
 User Agents that would have to be kept up to date. It would obviously be
 hard for that list to ever be fully exhaustive. ''See an example of the
 user agents we would have to support below.''
 2. You can add an `ai.txt` file to your site, as [https://site.spawning.ai
 /spawning-ai-txt suggested by Spawning AI] here.
     - However, we have no guarantee AI services currently recognize and
 respect this file.
 3. You could add a meta tag to your site's `head`: `<meta name="robots"
 content="noai, noimageai" />`. This is something that was apparently first
 [https://www.deviantart.com/team/journal/UPDATE-All-Deviations-Are-Opted-
 Out-of-AI-Datasets-934500371 implemented by DeviantArt].
     - I do not know if this is actually respected by AI services. It is
 not an HTML standard today. In fact, discussions for a new HTML standard
 are still in progress, and suggest a different tag
 ([https://github.com/whatwg/html/issues/9334 reference]).
     - If a standard like that were to be accepted, and if AI Services
 agreed to use it, it may be the best implementation in the future since we
 would not have to define a list of AI services.
 4. You can completely block specific User Agents from accessing the site.
     - I believe we may not want to implement something that drastic and
 potentially blocking real visitors in WordPress Core. This is something
 that is better left to plugins.

 ----

 Some plugins already exist that implement some of the approaches above. It
 shows that there may be interest to include such a feature in Core.

 - [https://wordpress.org/plugins/cellarweb-chatbot-blocker/ ChatBot
 Blocker]
 - [https://wordpress.org/plugins/simple-noai-and-noimageai/ Simple NoAI
 and NoImageAI]
 - [https://wordpress.org/plugins/block-ai-crawlers/ Block AI Crawlers]
 - [https://wordpress.org/plugins/block-chat-gpt-via-robots-txt/ Block Chat
 GPT via robots.txt]
 - [https://wordpress.org/plugins/block-common-crawl-via-robots-txt/ Block
 Common Crawl via robots.txt]
 - [https://github.com/thefrosty/wp-block-ai-scrapers WordPress Block AI
 Scrapers]

 ----

 If we were to go with the first option, here are some examples of the User
 Agents we would have to support:

 - `Amazonbot` -- https://developer.amazon.com/support/amazonbot
 - `anthropic-ai` -- https://www.anthropic.com/
 - `Bytespider` -- https://www.bytedance.com/
 - `CCBot` -- https://commoncrawl.org/ccbot
 - `ClaudeBot` -- https://claude.ai/
 - `cohere-ai` -- https://cohere.com/
 - `FacebookBot` -- https://developers.facebook.com/docs/sharing/bot
 - `Google-Extended` -- https://blog.google/technology/ai/an-update-on-web-
 publisher-controls/
 - `GPTBot` -- https://platform.openai.com/docs/gptbot
 - `omgili` -- https://webz.io/blog/web-data/what-is-the-omgili-bot-and-
 why-is-it-crawling-your-website/
 - `omgilibot` -- https://webz.io/blog/web-data/what-is-the-omgili-bot-and-
 why-is-it-crawling-your-website/
 - `SentiBot` -- https://sentione.com/
 - `sentibot` -- https://sentione.com/

 This list could be made filterable so folks can extend or modify that list
 as they see fit.

-- 
Ticket URL: <https://core.trac.wordpress.org/ticket/60805>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform