Bug: National characters are encoded in Search Results
When I write some national characters (Russian, in my case) in the topic text, they are displayed without any changes by the "view" script - this is what I expect, and this can be correctly displayed by Russian-aware browsers. But when this topic (its beginning, in fact) is displayed in the search results (or on
WebChanges - where some kind of "search" is used AFAIK), those characters are encoded (in
HTML) with the sequence like
Ñ. And my
OperaBrowser (
InternetExplorer behaves similarly) displays them as various accented characters - this is not what I want, of course.
I didn't find the code that does this thing. (If I would, this report would maybe never be created.) But I noticed that this behaviour appeared after the last (december 2001) release.
For English TWiki sites, this is not the bug, but some inconsistency in the code at most (the topic text is rendered differently in search results, without the clear "motivation" for this difference). But for users whose languages use non-standard sets of high-8-bit characters, this is a bug that should be fixed (or at least they should know how to fix it).
Test case
See
NationalCharactersEncodedInSearchResultsTest and
Search Results for NationalCharactersEncodedInSearchResultsTest (English people will probably have to look to the
HTML source in order to notice the difference).
Environment
--
PavelGoran - 04 Nov 2002
Follow up
This is an artefact of fixing
NbspBreaksRssFeed, needed for
TWikiSyndication.
Any idea how to fix the encoding without breaking the
RssFeeds?
--
PeterThoeny - 06 Nov 2002
How about adding an option for encoding search results as required by
TWikiSyndication? This would mean that the
WebRssBase page needs updating, but normal searches would not be affected and would then get the unencoded characters.
--
RichardDonkin - 08 Nov 2002
Yes, I thought about this too. But I couldn't take a look at the
WebReeBase page, so I don't know how difficult it would be to pass the additional switch to the search engine that will allow encoding.
By the way - what was the need of encoding high-8-bit characters in
RssFeeds? The problem described in
NbspBreaksRssFeed was about
encoded entities, not about "bad raw characters". Also I wonder why not to fix
NbspBreaksRssFeed with more logical and "legitimate" way - by "registering"
XHTML entities as proposed by
RichardDonkin in
NbspBreaksRssFeed.
--
PavelGoran - 13 Nov 2002
Fixing
WebRssBase would not be too hard IMO, the issue is changing how
search encodes results when not used for RSS. However, I'm not clear why this encoding didn't work - are you in a character set other than ISO-8859-1? Also, have you changed the character set specified in the templates?
search.tmpl is used to format search results, so you may find setting its character set to whatever you are using will fix this problem, because the browser will then render the characters properly. You should also set the charset in all other templates where 'iso-8859-1' appears (quite a lot of them...).
Changing browser character sets will be easier quite soon, I am working on this as part of
InternationalisationEnhancements. There is a new
TWikiAlphaRelease that should work much better with Russian characters (e.g.
WikiWords and web names can include these characters) - let me know if you try it.
--
RichardDonkin - 01 Dec 2002
Fix record
This is now fixed in the latest
TWikiAlphaRelease.
I tested this on
http://donkin.org/bin/view/Test/RussianText
under the KOI8-R character set (will probably be back on ISO-8859-1 soon though), using the new %CHARSET% variable in normal templates as well as the
view.rss.tmpl used by RSS feeds, and it worked fine - I just took out the encoding of 8-bit characters, which only works with ISO-8859-1 as you pointed out. Mozilla 1.1, IE 5.5 and
FeedReader were all OK with the Russian characters in the RSS feed -
FeedReader displayed them as ISO-8859-1 but did not give any errors. Since Mozilla and IE have quite pedantic
XML parsers for RSS feeds, I think that's good enough for the RSS reading people and would fix this for Russian users (and others outside the ISO-8859-1 area).
See
InternationalisationEnhancements for full details of the fixes and be sure to set your $siteLocale to something like
ru_RU.KOI8-R in TWiki.cfg, using
TWikiAlphaRelease of course.
--
RichardDonkin - 08 Dec 2002
The fix for this is updated in
TWikiAlphaRelease so that RSS feeds have 8-bit characters encoded (even though they should be able to cope IMO), and normal search results don't have such encoding. May still find that RSS feeds from non-ISO-8859-1 TWiki sites are not rendered properly by some RSS readers, but getting such scenarios to work may require UTF8 - almost by definition, RSS readers bring in data from many different websites, and UTF8 is probably the only way they can handle diverse national characters from sites using different charsets.
--
RichardDonkin - 10 Dec 2002
This bug fix has a bug - even for RSS feeds, characters with high bit set should not be encoded as
XML entities, since that means they are automatically interpreted as Unicode. Instead, they should just be passed through unchanged, meaning that the %CHARSET% in the
WebRssBase page's
XML declaration tag will be used by the browser or RSS reader (according to the
XML specs anyway.) See
PageModeRssEncodeBug for discussion.
The result is that search will never encode high-bit characters as entities, whether in RSS mode or not.
--
RichardDonkin - 16 Apr 2003