Tags:
create new tag
view all tags

Bug: National characters are encoded in Search Results

When I write some national characters (Russian, in my case) in the topic text, they are displayed without any changes by the "view" script - this is what I expect, and this can be correctly displayed by Russian-aware browsers. But when this topic (its beginning, in fact) is displayed in the search results (or on WebChanges - where some kind of "search" is used AFAIK), those characters are encoded (in HTML) with the sequence like Ñ. And my OperaBrowser (InternetExplorer behaves similarly) displays them as various accented characters - this is not what I want, of course.

I didn't find the code that does this thing. (If I would, this report would maybe never be created.) But I noticed that this behaviour appeared after the last (december 2001) release.

For English TWiki sites, this is not the bug, but some inconsistency in the code at most (the topic text is rendered differently in search results, without the clear "motivation" for this difference). But for users whose languages use non-standard sets of high-8-bit characters, this is a bug that should be fixed (or at least they should know how to fix it).

Test case

See NationalCharactersEncodedInSearchResultsTest and Search Results for NationalCharactersEncodedInSearchResultsTest (English people will probably have to look to the HTML source in order to notice the difference).

Environment

TWiki version: twiki.org
Web Browser: OperaBrowser

-- PavelGoran - 04 Nov 2002

Follow up

This is an artefact of fixing NbspBreaksRssFeed, needed for TWikiSyndication.

Any idea how to fix the encoding without breaking the RssFeeds?

-- PeterThoeny - 06 Nov 2002

How about adding an option for encoding search results as required by TWikiSyndication? This would mean that the WebRssBase page needs updating, but normal searches would not be affected and would then get the unencoded characters.

-- RichardDonkin - 08 Nov 2002

Yes, I thought about this too. But I couldn't take a look at the WebReeBase page, so I don't know how difficult it would be to pass the additional switch to the search engine that will allow encoding.

By the way - what was the need of encoding high-8-bit characters in RssFeeds? The problem described in NbspBreaksRssFeed was about encoded entities, not about "bad raw characters". Also I wonder why not to fix NbspBreaksRssFeed with more logical and "legitimate" way - by "registering" XHTML entities as proposed by RichardDonkin in NbspBreaksRssFeed.

-- PavelGoran - 13 Nov 2002

Fixing WebRssBase would not be too hard IMO, the issue is changing how search encodes results when not used for RSS. However, I'm not clear why this encoding didn't work - are you in a character set other than ISO-8859-1? Also, have you changed the character set specified in the templates? search.tmpl is used to format search results, so you may find setting its character set to whatever you are using will fix this problem, because the browser will then render the characters properly. You should also set the charset in all other templates where 'iso-8859-1' appears (quite a lot of them...).

Changing browser character sets will be easier quite soon, I am working on this as part of InternationalisationEnhancements. There is a new TWikiAlphaRelease that should work much better with Russian characters (e.g. WikiWords and web names can include these characters) - let me know if you try it.

-- RichardDonkin - 01 Dec 2002

Fix record

This is now fixed in the latest TWikiAlphaRelease.

I tested this on http://donkin.org/bin/view/Test/RussianText under the KOI8-R character set (will probably be back on ISO-8859-1 soon though), using the new %CHARSET% variable in normal templates as well as the view.rss.tmpl used by RSS feeds, and it worked fine - I just took out the encoding of 8-bit characters, which only works with ISO-8859-1 as you pointed out. Mozilla 1.1, IE 5.5 and FeedReader were all OK with the Russian characters in the RSS feed - FeedReader displayed them as ISO-8859-1 but did not give any errors. Since Mozilla and IE have quite pedantic XML parsers for RSS feeds, I think that's good enough for the RSS reading people and would fix this for Russian users (and others outside the ISO-8859-1 area).

See InternationalisationEnhancements for full details of the fixes and be sure to set your $siteLocale to something like ru_RU.KOI8-R in TWiki.cfg, using TWikiAlphaRelease of course.

-- RichardDonkin - 08 Dec 2002

The fix for this is updated in TWikiAlphaRelease so that RSS feeds have 8-bit characters encoded (even though they should be able to cope IMO), and normal search results don't have such encoding. May still find that RSS feeds from non-ISO-8859-1 TWiki sites are not rendered properly by some RSS readers, but getting such scenarios to work may require UTF8 - almost by definition, RSS readers bring in data from many different websites, and UTF8 is probably the only way they can handle diverse national characters from sites using different charsets.

-- RichardDonkin - 10 Dec 2002

This bug fix has a bug - even for RSS feeds, characters with high bit set should not be encoded as XML entities, since that means they are automatically interpreted as Unicode. Instead, they should just be passed through unchanged, meaning that the %CHARSET% in the WebRssBase page's XML declaration tag will be used by the browser or RSS reader (according to the XML specs anyway.) See PageModeRssEncodeBug for discussion.

The result is that search will never encode high-bit characters as entities, whether in RSS mode or not.

-- RichardDonkin - 16 Apr 2003

Edit | Attach | Watch | Print version | History: r9 < r8 < r7 < r6 < r5 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r9 - 2020-04-26 - PeterThoeny
 
  • Learn about TWiki  
  • Download TWiki
This site is powered by the TWiki collaboration platform Powered by Perl Hosted by OICcam.com Ideas, requests, problems regarding TWiki? Send feedback. Ask community in the support forum.
Copyright © 1999-2026 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.