[wp-hackers] Blocking SEO robots

Jeremy Clarke jer at simianuprising.com
Wed Aug 6 12:31:29 UTC 2014


On Wednesday, August 6, 2014, David Anderson <david at wordshell.net> wrote:

> The issue's not about how to write blocklist rules; it's about having a
> reliable, maintained, categorised list of bots such that it's easy to
> automate the blocklist. Turning the list into .htaccess rules is the easy
> bit; what I want to avoid is having to spend long churning through log
> files to obtain the source data, because it feels very much like something
> there 'ought' to be pre-existing data out there for, given how many watts
> the world's servers must be wasting on such bots.


The best answer is the htaccess-based blacklists from PerishablePress. I
think this is the latest one:

http://perishablepress.com/5g-blacklist-2013/

He uses a mix of blocked user agents, blocked IP's and blocked requests
(i.e /admin.php, intrusion scans for other software). He's been updating it
for years and it's definitely a WP-centric project.

In the past some good stuff has been blocked by his lists (Facebook spider
blocked because it had an empty user agent, common spiders used by
academics were blocked) but that's bound to happen and I'm sure every UA
was used by a spammer at some point.

I run a ton of sites on my server so I hate the .htaccess format (which is
a pain to implement alongside wp+super cache rules). If I used multisite it
would be less of a big deal. Either way, know that you can block UA's for
all virtual hosts if that's relevant.

Note that ip blocking is a lot more effective at the server level because
blocking with Apache still uses a ton of resources (but at least no MySQL
etc). On Linux an iptables based block is much more effective.




-- 
Jeremy Clarke
Code and Design • globalvoicesonline.org


More information about the wp-hackers mailing list