[wp-edu] Uploads folder content indexed in Google?

Ben Bakelaar bakelaar at rutgers.edu
Wed Sep 16 20:04:17 UTC 2015


Yes, that was it! Thanks Richard.



Created robots.txt file:

User-agent: *

Disallow: /wp-content/uploads/sites/7/



Tested with Fetch as Googlebot. Went to Google Index > Remove URLs within
Google Webmaster Tools. Clicked on “Create a new removal request”. Entered
directory name only:

/wp-content/uploads/sites/7[image:
https://www.google.com/webmasters/tools/images/url_icon.png]
<http://eclipse.rutgers.edu/wp-content/uploads/sites/7>



On the next screen, the third option on the drop-down menu is “Remove
directory” which normally is not there if you enter a full URL. Submitted
and pending!



I also turned off “Indexes” in httpd.conf, not sure if it was on for “all”
sub-sites before. But that’s my working theory, that directory listings
were turned on and somehow search bots got to those pages and then they
indexed each file and sub-dir listed.



I guess my outdated conception was that search engine bots only scour the
web via finding extant links within HTML documents – so if there is no
public link to that content, it will never get indexed. It appears they may
be more aggressive now, using algorithms to predict sub-directories
(perhaps based on CMS detection?) and then scan for available content? Just
a working theory since I still can’t explain it.



---------------------------------
BEN BAKELAAR | IT Services
School of Communication and Information

Rutgers, The State University of New Jersey
p 848.932.8710







*From:* wp-edu [mailto:wp-edu-bounces at lists.automattic.com] *On Behalf
Of *Berardi,
Richard
*Sent:* Tuesday, September 15, 2015 9:22 PM
*To:* Low-traffic list discussing WordPress in education.
*Cc:* jon.oliver at rutgers.edu
*Subject:* Re: [wp-edu] Uploads folder content indexed in Google?



*Removing an entire directory or site*

In order for a directory or site-wide removal to be successful, the
directory or site must be *disallowed in the site's robots.txt file
<http://www.google.com/support/webmasters/bin/answer.py?answer=35302>*. For
example, in order to remove the http://www.example.com/secret/ directory,
your robots.txt file would need to include:
   User-agent: *
   Disallow: /secret/
It isn't enough for the root of the directory to return a 404 status code,
because it's possible for a directory to return a 404 but still serve out
files underneath it. Using robots.txt to block a directory (or an entire
site) ensures that all the URLs under that directory (or site) are blocked
as well. You can test whether a directory has been blocked correctly using
either the Fetch as Googlebot
<http://www.google.com/support/webmasters/bin/answer.py?answer=158587> or Test
robots.txt
<http://www.google.com/support/webmasters/bin/answer.py?answer=156449> features
in Webmaster Tools.

Only verified owners of a site can request removal of an entire site or
directory in Webmaster Tools. To request removal of a directory or site,
click on the site in question, then go to *Site configuration > Crawler
access > Remove URL*. If you enter the root of your site as the URL you
want to remove, you'll be asked to confirm that you want to remove the
entire site. If you enter a subdirectory, select the "Remove directory"
option from the drop-down menu.



http://googlewebmastercentral.blogspot.com/2010/03/url-removal-explained-part-i-urls.html?m=1

Hope this helps.

Sent from my  iPhone 6


On Sep 15, 2015, at 5:59 PM, Ben Bakelaar <bakelaar at rutgers.edu> wrote:

Hello all, it appears we have had some of the files on our Wordpress
network indexed in Google search results. I had assumed security through
obscurity here, but it appears I was wrong.



Our network runs sites as sub-directories, and we also use domain mapping
for some of them. I haven’t quite figured out how yet, but one of the
mapped domains (xyz, not root.url.com) which points to site A has shown up
in search results with absolute paths to files in a completely different
site B (which is actually a sub-dir site, not masked). And they load just
fine – this must be an unanticipated quirk of DNS records + the Wordpress
code that routes requests.



So we have URLs like
xyz.domain/wp-content/uploads/sites/x/xxxx/xx/filename.doc coming up in
results! Eek! I have already started the removal requests via Google
Webmaster Tools. Again no explanation yet for how these URLs were located
by the search engines, but I’m working on possible theories.



Aside from getting to the bottom of this, I’m trying to figure out the best
way to block this from happening in the future. Apache .htaccess rules are
one option. Robots.txt could be another? Has anyone run into this issue
before, and what have you done as a solution? I’m a little surprised this
isn’t addressed “in code”. There are many plugins that allow uploads, this
is a desired and supported user behavior by default. But there are no
conceivable use cases I can think of where those uploads should be able to
be indexed by bots.



Could I simply place robots.txt in the root of the WP codebase, and tell it
to avoid indexing ALL files under /wp-content? Would that cover all the
various access cases with direct-linked files (like graphics), domain
masking/mapping, etc.? And to fully prevent opening any uploads from
outside the university network (as a decent but arbitrary perimeter), can I
do the same with .htaccess or do I have to make dozens of .htaccess files
per /wp-content/uploads/sites/X – in each little sub-directory?





---------------------------------
BEN BAKELAAR | IT Services
School of Communication and Information

Rutgers, The State University of New Jersey
p 848.932.8710



_______________________________________________
wp-edu mailing list
wp-edu at lists.automattic.com
http://lists.automattic.com/mailman/listinfo/wp-edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.automattic.com/pipermail/wp-edu/attachments/20150916/295605a5/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 132 bytes
Desc: not available
URL: <http://lists.automattic.com/pipermail/wp-edu/attachments/20150916/295605a5/attachment.png>


More information about the wp-edu mailing list