[wp-trac] [WordPress Trac] #60805: Reading Settings: add option to discourage AI services from crawling the site
WordPress Trac
noreply at wordpress.org
Tue Mar 19 12:06:28 UTC 2024
#60805: Reading Settings: add option to discourage AI services from crawling the
site
-----------------------------+-----------------------------
Reporter: jeherve | Owner: (none)
Type: feature request | Status: new
Priority: normal | Milestone: Awaiting Review
Component: Privacy | Version:
Severity: normal | Keywords:
Focuses: privacy |
-----------------------------+-----------------------------
I'd like to suggest a new addition to the bottom of the Reading Settings
screen in the dashboard:
[[Image(https://cldup.com/p6xw24IFff.png)]]
This new section would help site owners indicate whether or not they would
like their content to be indexed by AI services and used to train future
AI models.
There have been a lot of discussions about this in the past 2 years:
content creators and site owners have asked whether their work could and
should be used to train AI. Opinions vary, but at the end of the day I
believe most would agree that as a site owner, it would be nice if I could
choose for myself, for my own site.
In practice, I would imagine the feature to work just like the Search
Engines feature just above: when toggled, it would edit the site's
`robots.txt` file and disallow a specific list of AI services from
crawling the site.
----
There are typically 4 main approaches to discouraging AI Services from
crawling your site:
1. You can add `robots.txt` entries matching the different User Agents
used by AI services, and asking them not to index content via a `Disallow:
/`.
- **This seems to be the cleanest approach, and the one that AI
services are the most likely to respect.**
- This also has an important limitation ; it relies on a list of AI
User Agents that would have to be kept up to date. It would obviously be
hard for that list to ever be fully exhaustive. ''See an example of the
user agents we would have to support below.''
2. You can add an `ai.txt` file to your site, as [https://site.spawning.ai
/spawning-ai-txt suggested by Spawning AI] here.
- However, we have no guarantee AI services currently recognize and
respect this file.
3. You could add a meta tag to your site's `head`: `<meta name="robots"
content="noai, noimageai" />`. This is something that was apparently first
[https://www.deviantart.com/team/journal/UPDATE-All-Deviations-Are-Opted-
Out-of-AI-Datasets-934500371 implemented by DeviantArt].
- I do not know if this is actually respected by AI services. It is
not an HTML standard today. In fact, discussions for a new HTML standard
are still in progress, and suggest a different tag
([https://github.com/whatwg/html/issues/9334 reference]).
- If a standard like that were to be accepted, and if AI Services
agreed to use it, it may be the best implementation in the future since we
would not have to define a list of AI services.
4. You can completely block specific User Agents from accessing the site.
- I believe we may not want to implement something that drastic and
potentially blocking real visitors in WordPress Core. This is something
that is better left to plugins.
----
Some plugins already exist that implement some of the approaches above. It
shows that there may be interest to include such a feature in Core.
- [https://wordpress.org/plugins/cellarweb-chatbot-blocker/ ChatBot
Blocker]
- [https://wordpress.org/plugins/simple-noai-and-noimageai/ Simple NoAI
and NoImageAI]
- [https://wordpress.org/plugins/block-ai-crawlers/ Block AI Crawlers]
- [https://wordpress.org/plugins/block-chat-gpt-via-robots-txt/ Block Chat
GPT via robots.txt]
- [https://wordpress.org/plugins/block-common-crawl-via-robots-txt/ Block
Common Crawl via robots.txt]
- [https://github.com/thefrosty/wp-block-ai-scrapers WordPress Block AI
Scrapers]
----
If we were to go with the first option, here are some examples of the User
Agents we would have to support:
- `Amazonbot` -- https://developer.amazon.com/support/amazonbot
- `anthropic-ai` -- https://www.anthropic.com/
- `Bytespider` -- https://www.bytedance.com/
- `CCBot` -- https://commoncrawl.org/ccbot
- `ClaudeBot` -- https://claude.ai/
- `cohere-ai` -- https://cohere.com/
- `FacebookBot` -- https://developers.facebook.com/docs/sharing/bot
- `Google-Extended` -- https://blog.google/technology/ai/an-update-on-web-
publisher-controls/
- `GPTBot` -- https://platform.openai.com/docs/gptbot
- `omgili` -- https://webz.io/blog/web-data/what-is-the-omgili-bot-and-
why-is-it-crawling-your-website/
- `omgilibot` -- https://webz.io/blog/web-data/what-is-the-omgili-bot-and-
why-is-it-crawling-your-website/
- `SentiBot` -- https://sentione.com/
- `sentibot` -- https://sentione.com/
This list could be made filterable so folks can extend or modify that list
as they see fit.
--
Ticket URL: <https://core.trac.wordpress.org/ticket/60805>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform
More information about the wp-trac
mailing list