How to run TWiki reliably on a public website

2015-10-16 - 02:34:07 by PeterThoeny in Deployment

TWiki is typically used in an access restricted environment, such as behind a corporate firewall. There are many publicly accessible TWiki sites as well, such as the NIST Cloud Computing Collaboration Site

, or this TWiki here on TWiki.org. If you run TWiki on a server accessible by the public you need to be address the following issues:

1. Unpredictable contributions - how to nurture contributions, how to moderate content to retain quality.
2. Unpredictable server load - how to curb rogue users and spiders.

This blog post addresses the unpredictable server load issue. I describe what we did to keep TWiki.org's TWiki operate smoothly. As for scale, our TWiki currently has 56,000 unique visitors/month, 68,000 registered users, and 140,000 wiki pages.

We have sometimes rogue spiders accessing wiki pages at a high rate. This can at times impose a high CPU load since TWiki is serving pages dynamically. This makes the site slow for everybody else. Here is what we did to address the issue:

Use robots.txt to tell spiders how to index.
Use caching to deliver pages without the TWiki engine.
Only authenticated users can access page meta content, such as view previous page revisions, or see the difference between revisions, etc.
Deny rogue spiders by using the "deny" directive in the Apache configuration.

1. Use robots.txt to tell spiders how to index

We can tell spiders that play nice how to index TWiki. There are three parts:

Delay of crawling
Exclude content
Disallow user agents

To tell spiders what to do place a robots.txt file in the HTML root directory. Spiders that play nicely will check this file for directions.

To indicate the crawl delay specify this:

User-agent: *
Crawl-delay: 60

This tells the spider to crawl pages not faster than at 60 second intervals.

To exclude content, or to exclude TWiki scripts in our case, specify this:

Disallow: /cgi-bin/attach
Disallow: /cgi-bin/changes
Disallow: /cgi-bin/edit
Disallow: /cgi-bin/geturl
Disallow: /cgi-bin/installpasswd
Disallow: /cgi-bin/login
Disallow: /cgi-bin/logon
Disallow: /cgi-bin/mailnotify
Disallow: /cgi-bin/manage
Disallow: /cgi-bin/oops
Disallow: /cgi-bin/passwd
Disallow: /cgi-bin/preview
Disallow: /cgi-bin/rdiff
Disallow: /cgi-bin/register
Disallow: /cgi-bin/rename
Disallow: /cgi-bin/rest
Disallow: /cgi-bin/mdrepo
Disallow: /cgi-bin/save
Disallow: /cgi-bin/search
Disallow: /cgi-bin/statistics
Disallow: /cgi-bin/upload
Disallow: /cgi-bin/viewauth
Disallow: /cgi-bin/viewfile
Disallow: /cgi-bin/view/Sandbox
Disallow: /cgi-bin/view/*?

As you can see, all scripts but the view script are disallowed. There is a view script with a /*?, This tells the spider to ignore all pages that have a URL parameter, such as to retrieve a previous page revision.

To disallow spiders specify this:

User-agent: iecrawler
User-agent: ie_crawler
User-agent: e-SocietyRobot
User-agent: SputnikBot
User-agent: SputnikImageBot
Disallow: /

For details see TWiki.org's robots.txt file at http://twiki.org/robots.txt

What about spiders that ignore the robots.txt file? Stay tuned.

2. Use caching to deliver pages without the TWiki engine

We installed the TWikiGuestCacheAddOn to cache content. Content is only cached for non-authenticated users. This is what we want - regular users should see the latest content (at the cost of some speed due to dynamic page rendering). Non-authenticated users are served cached content. There are three tiers, some pages need to be refreshed more often than others. Here is the configuration of the add-on as seen in the configure script:

{TWikiGuestCacheAddOn}{CacheAge} = '1460';
{TWikiGuestCacheAddOn}{Debug} = '0';
{TWikiGuestCacheAddOn}{Tier1CacheAge} = '1';
{TWikiGuestCacheAddOn}{Tier1Topics} = 'AskedAndAssignedQuestions, FastReport, TWikiFeatureProposals, WebAtom, WebChanges, WebChangesForAllWebs, WebRss';
{TWikiGuestCacheAddOn}{Tier2CacheAge} = '24';
{TWikiGuestCacheAddOn}{Tier2Topics} = 'WebHome, WebTopicList, TWikiConsultants';
{TWikiGuestCacheAddOn}{ExcludeTopics} = 'TWikiRegistration, BlogEntry201208x2';

The cache of tier 1 pages expires in 1 hour, tier 2 expires in 24 hours, and the rest expires in 60 days. 60 days looks like a long time, but there are many pages that do not change that often, and if a page is updated, its cache is invalidated anyway.

With this, spiders and other non-authenticated users get a cached page delivered in 0.1 seconds instead of 1.5 seconds. The cache hit rate with this configuration is 10:1.

In other words, pages are served much faster in average, and the CPU load is substantially reduced.

3. Only authenticated users can access page meta content

You can look at a TWiki page in many different ways, and with that, there are many URLs for a particular TWiki page. Authenticated users should be able to do everything, but we want to curb spiders and other non-authenticated users. So, if a non-authenticated users accesses the rdiff script for example, we ask for login. That way we can save CPU cycles since spiders don't authenticate.

On TWiki.org we use template login (in configure: {LoginManager} = 'TWiki::LoginManager::TemplateLogin'). With that we can specify which scripts require authentication. Again in configure:

{AuthScripts} = 'attach,edit,manage,rename,save,upload,viewauth,rdiff,rdiffauth,rest,mdrepo';

As you can see all relevant scripts except the view script and reset password script require authentication.

Now we want to lock down the view script with URL parameters. We want to allow certain URL parameters, but not the bulk of others. For example, we want non-authenticated users be able to see the slides in presentation mode. We can do that with an Apache rewrite rule in /etc/httpd/conf.d/twiki.conf:

<Directory "/var/www/twiki/bin">
    RewriteEngine On
    RewriteCond %{QUERY_STRING}  !^$
    RewriteCond %{QUERY_STRING}  !^(slideshow=|note=|search=|skin=text|skin=plain|tag=|dir=|ip=|TWikiGuestCache=)
    RewriteRule view/(.*) /cgi-bin/viewauth/$1

    # other directives...
</Directory>

So, if there is a URL parameter, but not named slideshow, note, search, skin=text, skin=plain, tag, dir, ip, TWikiGuestCache we redirect from the view script to the viewauth script. The viewauth script asks for authentication in case the user is not authenticated, otherwise delivers the page the same way view does.

4. Deny rogue spiders

Some spiders just don't play nice. We can address this with some Apache directives in /etc/httpd/conf.d/twiki.conf:

BrowserMatchNoCase 80legs blockAccess
BrowserMatchNoCase ^Accoona blockAccess
BrowserMatchNoCase ActiveAgent blockAccess
BrowserMatchNoCase AhrefsBot blockAccess
# etc...
BrowserMatchNoCase YodaoBot blockAccess
BrowserMatchNoCase ZanranCrawler blockAccess
BrowserMatchNoCase zerbybot blockAccess
BrowserMatchNoCase ZIBB blockAccess
BrowserMatchNoCase ^$ blockAccess

<Directory "/var/www/twiki/bin">
    RewriteEngine On
    RewriteCond %{QUERY_STRING}  !^$
    RewriteCond %{QUERY_STRING}  !^(slideshow=|note=|search=|skin=text|skin=plain|tag=|dir=|ip=|TWikiGuestCache=)
    RewriteRule view/(.*) /cgi-bin/viewauth/$1

    AllowOverride None
    Order Allow,Deny
    Allow from all
    Deny from env=blockAccess 
    Deny from 120.36.149.0/24 195.228.43.194 104.131.208.52
</Directory>

With the BrowserMatchNoCase <name> blockAccess directives we list all spiders that are not welcome. In the <Directory> section we deny those spiders by specifying a Deny from env=blockAccess.

In addition, we can manually add rogue IP addresses by specifying a Deny from <list of IP addresses>.

I hope you can use these tips to make your TWiki site run smoothly. Let us know what issues you face and how you secure your TWiki server in the comments below.

Comments

Topic revision: r4 - 2015-11-16 - PeterThoeny

Account
- Log In
- Register User

Edit
Attach

Ideas, requests, problems regarding TWiki? Send feedback. Ask community in the support forum.
Copyright © 1999-2026 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.