Feature Proposal: Search Engines Should Index Only Plain View
Motivation
Tell search engines to index only the plain topic, not any specialized view of a topic. This reduces the clutter when googling TWiki.org and other public TWikis.
Description
Google indexes too much of TWiki.org and other public TWikis. If you search "with the omitted results included" you get many hits for a particular topic, e.g.
http://www.google.com/search?q=blacklistplugin+plugins+site:twiki.org&hl=en&lr=&ie=UTF-8&filter=0
returns
BlackListPlugin, but also variants with parameters
?_foo=1.7,
?sortcol=4&table=4&up=0,
?skin=print.pattern,
?raw=on, etc.
There is also the problem highlighted in
RawParamLeaksEmailAddresses that "raw" views are also indexed and cached and in these views email addresses are returned unobscured. It is possable to create a search in google for raw user topics and view them in googles cache.
Only the plain topic should get indexed, e.g. each TWiki topic should be indexed only once, the one without any parameters.
Impact and Available Solutions
Current spec:
The view script already adds a
<meta name="robots" content="noindex" /> tag if you look at an older topic revision. Technically speaking, the skin has the tag by default, and the view script
removes it if you are looking at the top revision.
Proposed new spec:
Do not remove the noindex tag if the view script has
any URL parameters. This has the desired effect that only the plain topic gets indexed.
In addition, we can make it easier for the search engines by telling what links
not to follow, e.g. we can add a rel="nofollow" parameter to the anchor tags of links such as printable, older revs, table sort, view raw etc. (The
BlackListPlugin does that for external links.)
--
PeterThoeny - 23 Jan 2005
Documentation
Examples
logout is a search term which provides enough hits (but not too many) to make it a reasonable test case. Compare:
Implementation
Discussion:
Just to be clear, when you say "plain" you mean the default skin, what any first time vistor would see coming to the site, not the "plain skin", correct?
I agree with the proposal.
--
MattWilkie - 23 Jan 2005
Yes, plain as in "URL with no parameters", e.g.
https://twiki.org/cgi-bin/view/Codev/SearchEngineIndexOnlyPlainView for this topic.
--
PeterThoeny - 23 Jan 2005
note that there are also entries in the google index for actions other than view:
viewauth,
attach,
oops, and
rdiff
perhaps a solution lies in a
robots.txt which returns only plain view url's for each topic (similiar to
WebTopicList), the idea being rather than deny access, tell the bots what is allowed.
--
WillNorris - 23 Jan 2005
I've changed the priority to 5 because of google caching raw views, (see pagagraph added to the Description above)
We could also put
rel="nofollow" on links with parameters to stop google even retriving the page to save bandwidth (or does google only request the header and if it has
<meta name="robots" content="noindex" /> it doesn't retrieve the rest of the page, so adding nofollow wouldn't lead to much of a bandwidth saving, and would increase page processing unnessasarily).
--
SamHasler - 25 Jan 2005
Ah, I see using
rel="nofollow" has already been suggested. However it could lead to little saving if I'm right about headers.
SpeedUpTipsForTWiki20040901
point 2 highlights the fact that links to searches with parameters that are followed will eat up CPU resources.
Ok, they might not get indexed if they have
in them, but the server has still had to generate the page.
So it would be a good idea to add
rel="nofollow" to any internal links with parameters not just to stop indexing, but to save CPU resources.
--
SamHasler - 01 Feb 2005
Is spending processing time finding links to add
rel="nofollow" to the easiest way of doing this?
Couldn't we just add
Disallow: /*?* to robots.txt?
--
SamHasler - 01 Feb 2005
seems nice and simple. I think it will work for Google, but the robot.txt validator says wildcards are non-standard (e.g. other spiders might ignore it).
http://www.searchengineworld.com/cgi-bin/robotcheck.cgi
--
MattWilkie - 01 Feb 2005
Talking to Peter last night he pointed out that not everyone has access to the robots.txt file for thier site, so we will have to implement a
rel="nofollow" solution as well.
--
SamHasler - 02 Feb 2005
I tried the
rel="nofollow" solution recently in one of our local TWikis and the
robots.txt in another. The result was, that the
rel -tag took less than a week to work with Google. While after 4 weeks I am still waiting for
robots.txt to work.
--
ChristopherOezbek - 07 Feb 2005
Maybe due to the size or pagerank of the sites they are not indexed/crawled as frequently.
It could also be that Google are specifically looking for sites that use
rel="nofollow" to reindex at this time.
--
SamHasler - 07 Feb 2005
Done in
DevelopBranch r3675. I added $cfg{NoFollow} that is set to the string rel='nofollow' by default. This is used in building links in code, and expanded as in templates and topics (though it should arguably be %CFG{"nofollow"}%, but that's a discussion for another day).
--
CrawfordCurrie - 20 Feb 2005
Crawford,
- does that mean that all URLs have a "nofollow" if set in the configuration? I think this is overkill if this is the case.
- I deliberately made the change in the BlackListPlugin since the nofollow feature is not in line with the TWikiMission.
- My proposed change is to add the nofollow only to links of topics that should never be indexed, such as rdiff etc (see above "Impact and Available Solutions").
- There is another issue: I am intersted in keeping twiki.org's Google ranking high. That is, we should not prevent Google from finding twiki.org via thousands of public TWiki sites.
- Unless there are good counter arguments I suggest to revert Crawfords last change.
- Also, all parameter tags TWiki generates so far use double quotes. Single quote might be legal, but not sure if all user agents support that. So, better to generate
rel="nofollow" as the BlackListPlugin does. (This point is N/A if change is reverted)
--
PeterThoeny - 22 Feb 2005
1 and 3: I followed your 23 Jan 2005 spec above. As I read it, that was the proposed change i.e.
we can add a rel="nofollow" parameter to the anchor tags of links such as printable, older revs, table sort, view raw etc. Normal topic links with no parameters do not carry nofollow. Neither do external links. If you are in doubt, inspect the results of the current code at
http://develop.twiki.org/~develop/cgi-bin/view/
2: In what way is nofollow not in line with the TWiki mission? Is it because it is mainly targeted at public TWikis? Many intranets use robots to index their own internal websites; why would
nofollow be uninteresting to them?
4: A link from a view page to TWiki.org is an external link and is not marked nofollow. Linsk to other view pages are not marked nofollow. You can always set $cfg{NoFollow} to the empty string if you want all link types (such as rdiff and ?raw=on) to be followed.
5: Maybe it's just me, but I can't think of a good reason why this change
should be reverted.....?
6:
RFC1866
(the
HTML 2.0 spec, the earliest I can find) says:
The value of the attribute may be either:
* A string literal, delimited by single quotes or double
quotes and not containing any occurrences of the delimiting
character.
--
CrawfordCurrie - 22 Feb 2005
While "nofollow" is nice, one would think that having the a
<meta name="robot" ... statement in the skind header would be more effective. It turns out it is there, but it is edited out in
View.pm for all except older revisions.
This makes no sense to me. What is the point of making the site unconditionally indexable?
--
AntonAylward - 17 Jul 2005
Not so. In
CairoRelease it is never edited out. In
DevelopBranch it is edited out only if you have enabled {AntiSpam}{RobotsAreWelcome} in
configure.
--
CrawfordCurrie - 18 Jul 2005