Black List for Robots
Some robots (web crawlers, spiders) are badly behaved and make many page accesses in a short time.
Fortunately there is a method to control this. It consists of including a robots.txt
file in the root of the web site. Every robot should (and most do)
retrieve this file ("/robots.txt") before trying to index any part
of a site. Some information can be found at
http://www.robotstxt.org/wc/robots.html
and
Wikipedia:Robots.txt
The file must contain a list of entries separated by blank lines.
Each entry contains a set of
Parameter: Value lines. The first line
of an entry is
User-agent: name of agent
and the others are
Disallow: path .
And example of file that would exclude completely all robots is :
User-agent: *
Disallow: /
An example for twiki (2003-Feb release) which disallows everything except
view is (also
attached):
User-agent: *
Disallow: /bin/attach
Disallow: /bin/changes
Disallow: /bin/edit
Disallow: /bin/geturl
Disallow: /bin/installpasswd
Disallow: /bin/mailnotify
Disallow: /bin/manage
Disallow: /bin/oops
Disallow: /bin/passwd
Disallow: /bin/preview
Disallow: /bin/rdiff
Disallow: /bin/register
Disallow: /bin/rename
Disallow: /bin/save
Disallow: /bin/savemulti
Disallow: /bin/search
Disallow: /bin/setlib.cfg
Disallow: /bin/statistics
Disallow: /bin/testenv
Disallow: /bin/upload
Disallow: /bin/viewauth
Disallow: /bin/viewfile
Disallow: /list/
Without this file, I was getting a lot of failures in the
data/log*.txt files on Edit on various pages I had never tried to manually edit, but not on all pages on the site - I suspect that crawlers give up on a site after they experience a certain number of errors. If so, this may help public and intranet TWiki sites get properly indexed by search engines.
There is also a Meta Tag to stop robots too, which TWiki uses: the
<META NAME="ROBOTS" CONTENT="NOINDEX"> tag for all but the view of the latest topic revision. (Note: this means that if you
want search engines to index old revisions you need to remove this line from the templates in your default skin).
Q: If I create a page with a search on it, can I add some kind of a "no robots" tag in the "text" of that page to keep it from being indexed, or must I put that tag at the very top of the HTML for that page, which means it must be part of the template? (And, if I can put a tag in the text, what should it be?)
A: Put this in the body of the page somewhere:
<pre><meta name="robots" content="noindex" /></pre>
Note: This uncomfirmed. Search engines may or may not honour it. The same technique does work for other things like embedding per-page stylesheets or javascript.
Some robots (which might even be aware of the /robots.txt file) can be hitting a site at a high rate. One possible solution would be to measure the access rate and block an IP address automatically in case it exceeds a certain average rate - lets say more then 2 hits per second. See
BlackListPlugin.
Not really related to twiki, but keep in my mind when you use
robots.txt you are effectively telling malevolent web crawlers
where to look for secret stuff.
PeterThoeny,
ChristopheVermeulen,
MathewBoorman,
RichardDonkin,
RandyKramer,
MattWilkie,
WillNorris,
WalterMundt,
Committed. Still needs to be documented (where is appropriate?).
--
WalterMundt
Sounds good. Should go into the
TWikiRoot that has already sample .htaccess files. Do not forget to update the docs.
TWikiInstallationGuide looks like a logical place to me.
--
PeterThoeny - 20,30 Apr 2004
I bump this from Cairo to
DakarRelease since doc is pending.
--
PeterThoeny - 16 Aug 2004