Tags:
create new tag
, view all tags

Archive of Discussion on EncodeURLsWithUTF8

From InternationalisationUTF8 - older discussion.

I think it's impractical to require the browser's UTF-8 url encoding to be disabled, especially for public sites. Wouldn't it be a simple matter to support a UTF-8 encoded url from a GET or a POST, at least partially in order to remove this user requirement for a large number of situations? A UTF-8 encoded character breaks down as follows:

bytes bits representation
1 7 0vvvvvvv
2 11 110vvvvv 10vvvvvv
3 16 1110vvvv 10vvvvvv 10vvvvvv
4 21 11110vvv 10vvvvvv 10vvvvvv 10vvvvvv

Converting looks sort of like this in Perl:

from ISO 8859-1 to UTF-8:  s/([\x80-\xFF])/chr(0xC0|ord($1)>>6).chr(0x80|ord($1)&0x3F)/eg;

from UTF-8 to ISO 8859-1:  s/([\xC2\xC3])([\x80-\xBF])/chr(ord($1)<<6&0xC0|ord($2)&0x3F)/eg;  

Without going the full Unicode route yet, it seems like it would be a relatively straightforward hack to make TWiki do the right thing in a few more instances where the user forgot to disable the UTF-8 URL encoding in the browser.

-- TomKagan - 18 Dec 2002

Interesting idea about the UTF8 decoding / encoding - however, there would need to be some way of recognising that a URL seen by TWiki in a POST or GET has been UTF8-encoded. From what I've read of Unicode, there is no reliable way of doing this, though heuristics might work most of the time. Some browsers (e.g. Mozilla) don't UTF8-encode URLs by default, so both modes would need to be supported. At present, non-UTF8-encoding is the only mode supported by all browsers I have tried, i.e. the major ones listed above. I'd need to think about this, but it's unlikely this will make it into BeijingRelease.

Full UTF8 support in TWiki is probably the best route for public sites using I18N - at present, the I18N support is intended for intranet sites, or public sites aimed at a single locale where everyone tends to use a single character set (e.g. KOI8-R for Russia perhaps).

-- RichardDonkin - 19 Dec 2002

I understand the difficulty with what you are coding and how you are intending it to be used. However, a few nicely named Twiki topics on a public site using UTF-8 would be helpful to have under ISO-8859-1. Regardless, even on a public site with a different locale, it would be nice for Twiki to 'do the right thing' under this one additional circumstance.

UTF-8 encoded urls can be identified with a high degree of certainty by looking for '%HH%HH' in the string where 'HH' is the hex value of the character. It is highly unlikely to find the '%HH%HH' sequence in a non encoded url - even if you did, most cases for Twiki access urls would probably be invalid, anyway.

If any character of the url has the 8th bit set, it's not an encoded url. And, if no character has the 8th bit set and a '%HH%HH' isn't found, the encoding option doesn't matter because the resulting url would be the same regardless of the browser's option setting.

As an example, 'Fran?ois' would be UTF-8 encoded as 'Fran%c3%a7ois' in the url sent from the browser.

It should be relatively simple to better handle both cases for the browser's option transparently.

-- TomKagan - 19 Dec 2002

Unfortunately, your suggested heuristic for recognising UTF8-encoded URLs will break in some locales - for example, all of the letters in a Russian WikiWord are likely to be URL-encoded in both KOI8-R and UTF8, making it impossible to use this heuristic to tell which encoding was used (e.g. this KOI8-R test topic, linked from CyrillicSupport). JapaneseAndChineseSupport has the same problem. Generally, auto-detecting character sets is a difficult problem that would require a reasonable sample of text, ideally chosen by the application - however, I have not researched this very much, so pointers to any simple algorithms are welcome.

It's also important to decode UTF8 sequences carefully, for security reasons, rejecting overlong UTF8 sequences (as discussed in this UTF8 on Unix FAQ) - otherwise, someone could encode ../../etc/passwd as overlong UTF8 sequences, preventing TWiki from filtering out the dangerous '..' and '/' sequences. TWiki may need to move to 'filtering in' as discussed in TaintChecking, which would ensure that only 'letters' (alphabetic or ideogrammatic characters) are included in topic and web names.

-- RichardDonkin - 23 Dec 2002

I would like to add to what TomKagan said: If you want to be able to enter and display characters from any language in the world, Unicode encoding is the way to go. As far as I see, the most standard way would be to serve all pages UTF-8 encoded and translate all trans-ASCII characters in page names to be UTF-8 and then URL-encoded. The conversion to UTF-8 will be done by the client, because of the encoding setting of the HTML page.

For examples see http://wiktionary.org/wiki/Two, a page which contains Greek and Russians text on the same page. One could as well add Chinese and whatever... See the links to the Icelandic translations ("tv?r" and "tv?") for the effect. "?" in Unicode is the two-letter sequence "?ミ凭?", the URL encoded hex version of those characters is "%C3%B6".

-- MichaelMuellerHillebrand - 31 Jan 2003

I agree completely that UTF-8 is the way to go - however, getting this to work is more complex, because of some problems with Perl UTF-8 support, which may require Perl 5.8 (not used by most people yet) and the fact that some browsers don't yet support UTF-8, or only work with configuration - see Wikipedia's technical issues list. Broken or mis-configured browsers can actually corrupt UTF-8 Wiki pages, and the TWikiMission aims to support virtually any browser, so it's sensible to get the 8-bit charsets working first, and then move to UTF-8. There is the issue of what to do with non-UTF8 browsers, which will also have URL problems.

Of course, the work done so far on I18N is highly applicable to UTF-8, and will greatly simplify adopting it since all regexes are now centralised (and should already work with Unicode anyway).

Thanks for the link to Wiktionary.org, which is part of the Wikipedia initiative and runs on the PHP-based Wikipedia software. Their I18N does look quite advanced.

-- RichardDonkin - 01 Feb 2003

While researching this a little bit longer, I found some interesting side effects. I assumed that no special UTF-8 support from Perl would be necessary, if all data already arrives properly UTF-8 encoded. The only thing it seemed was to transfer any byte > \x7F into it's proper URL-encoded %XX notation.

But: This blows up file name lengths! E.g. Russian and Greek text needs two byte UTF-8 sequences all the time. So a word with let's say 7 characters will be 14 UTF-8 bytes which will become 42 characters in the file name! Chinese, Japanese, Korean symbols in general need three UTF-8 bytes and so file names will grow to nine times the source language character count. This in turn limits the possible length to a nineth or sixth compared to that of plain ASCII text.

-- MichaelMuellerHillebrand - 03 Feb 2003

Perl UTF-8 support is important in some areas, e.g. matching upper and lower case as part of WikiWords. The grep would also need to be able to do regex matching and case-insensitive matching using UTF-8, not just plain bytes - I think GNU grep can do this, but working locales would be needed.

The length of filenames and WikiWords is not a big issue IMO, since it is shared with all UTF-8 applications, although of course UTF-8 topics and filenames will use more storage than an English Wiki. Most OSs now support quite long filenames, e.g. 255 bytes, which are much longer than 2 to 9 times a typical WikiWord's length, so this shouldn't be a problem in practice. This is particularly the case with ideogrammatic writing systems such as Chinese or Japanese [kanji], where a 9-byte character really represents a whole word, i.e. not much longer than the same word in English in many cases.

(Copied from InternationalisationEnhancements since it's relevant here:)
Note that some ISO-8859-1 characters earlier in InternationalisationEnhancements, and now in InternationalisationUTF8, were corrupted inadvertently a few days ago - this was in revision 1.48 of 30 Jan 2003, done by myself smile ... (TWiki revision tracking is amazingly useful!) I was probably using Phoenix, a Mozilla-based browser, but I'm not 100% sure. Any Phoenix users should test carefully - most of my testing was done with IE5, Mozilla, Opera 6 and (some) K-Meleon, so I'm surprised to see this.

I suspect a UTF-related browser bug, since two characters ('?e') were turned into a single '?' character, which is probably indicating an invalid UTF-8 encoding. If it was a non-UTF-8 bug, the number of characters would have been preserved. Any browser that correctly uses the Content-Type header, controlled by the new TWiki %CHARSET% variable (currently iso-8859-1 on TWiki.org) should not corrupt TWiki pages, even if it supports UTF-8.

-- RichardDonkin - 04 Feb 2003

-- RichardDonkin - 27 Oct 2004

Topic revision: r1 - 2004-10-27 - RichardDonkin
 
  • Learn about TWiki  
  • Download TWiki
This site is powered by the TWiki collaboration platform Powered by Perl Hosted by OICcam.com Ideas, requests, problems regarding TWiki? Send feedback. Ask community in the support forum.
Copyright © 1999-2017 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.