Tags:
create new tag
view all tags
Black List for Robots

Some robots (web crawlers, spiders) are badly behaved and make many page accesses in a short time. Fortunately there is a method to control this. It consists of including a robots.txt file in the root of the web site. Every robot should (and most do) retrieve this file ("/robots.txt") before trying to index any part of a site. Some information can be found at http://www.robotstxt.org/wc/robots.html and Wikipedia:Robots.txt

The file must contain a list of entries separated by blank lines. Each entry contains a set of Parameter: Value lines. The first line of an entry is User-agent: name of agent and the others are Disallow: path .

And example of file that would exclude completely all robots is :

User-agent: *
Disallow: /

An example for twiki (2003-Feb release) which disallows everything except view is (also attached):

User-agent: *
Disallow: /bin/attach
Disallow: /bin/changes
Disallow: /bin/edit
Disallow: /bin/geturl
Disallow: /bin/installpasswd
Disallow: /bin/mailnotify
Disallow: /bin/manage
Disallow: /bin/oops
Disallow: /bin/passwd
Disallow: /bin/preview
Disallow: /bin/rdiff
Disallow: /bin/register
Disallow: /bin/rename
Disallow: /bin/save
Disallow: /bin/savemulti
Disallow: /bin/search
Disallow: /bin/setlib.cfg
Disallow: /bin/statistics
Disallow: /bin/testenv
Disallow: /bin/upload
Disallow: /bin/viewauth
Disallow: /bin/viewfile
Disallow: /list/

Without this file, I was getting a lot of failures in the data/log*.txt files on Edit on various pages I had never tried to manually edit, but not on all pages on the site - I suspect that crawlers give up on a site after they experience a certain number of errors. If so, this may help public and intranet TWiki sites get properly indexed by search engines.

There is also a Meta Tag to stop robots too, which TWiki uses: the <META NAME="ROBOTS" CONTENT="NOINDEX"> tag for all but the view of the latest topic revision. (Note: this means that if you want search engines to index old revisions you need to remove this line from the templates in your default skin).

Q: If I create a page with a search on it, can I add some kind of a "no robots" tag in the "text" of that page to keep it from being indexed, or must I put that tag at the very top of the HTML for that page, which means it must be part of the template? (And, if I can put a tag in the text, what should it be?)

A: Put this in the body of the page somewhere:

<pre><meta name="robots" content="noindex" /></pre>
Note: This uncomfirmed. Search engines may or may not honour it. The same technique does work for other things like embedding per-page stylesheets or javascript.


Some robots (which might even be aware of the /robots.txt file) can be hitting a site at a high rate. One possible solution would be to measure the access rate and block an IP address automatically in case it exceeds a certain average rate - lets say more then 2 hits per second. See BlackListPlugin.


Not really related to twiki, but keep in my mind when you use robots.txt you are effectively telling malevolent web crawlers where to look for secret stuff.


PeterThoeny, ChristopheVermeulen, MathewBoorman, RichardDonkin, RandyKramer, MattWilkie, WillNorris, WalterMundt,

Committed. Still needs to be documented (where is appropriate?).

-- WalterMundt

Sounds good. Should go into the TWikiRoot that has already sample .htaccess files. Do not forget to update the docs. TWikiInstallationGuide looks like a logical place to me.

-- PeterThoeny - 20,30 Apr 2004

I bump this from Cairo to DakarRelease since doc is pending.

-- PeterThoeny - 16 Aug 2004

Topic attachments
I Attachment History Action Size Date Who Comment
Texttxt robots.txt r1 manage 0.5 K 2003-06-02 - 20:40 UnknownUser updated to feb1 2003 release.
Edit | Attach | Watch | Print version | History: r25 < r24 < r23 < r22 < r21 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r25 - 2005-06-05 - WillNorris
 
  • Learn about TWiki  
  • Download TWiki
This site is powered by the TWiki collaboration platform Powered by Perl Hosted by OICcam.com Ideas, requests, problems regarding TWiki? Send feedback. Ask community in the support forum.
Copyright © 1999-2026 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.