Tags:
create new tag
view all tags

Translation Token Clashes With 2-Byte Characters

Well, I'm using Korean as default language and found the $TranslationToken clashes with some Korean (2-byte) letters such as _B3_AA (sounds 'nah').

Words that include such letters became invisible in web pages, although stored correctly in file. Of course I coudn't use TWiki with this clash and I began thinking it will better if the $TranslationToken be longer, hard-to-clash word. So I set $TranslationToken as '__this_is_translation_token__' and the problem solved.

$TranslationToken of 1-byte length will be somewhat dangerous in multi-byte charset environment, I think. What if default $TranslationToken be more long, clash-safe string? Maybe this must be dicussed in the context of localization or I18N topics.

-- JikhanJung - 26 Oct 2001

I'm somewhat surprised nobody cared to answer this, let alone change this question's status to feature request smile

The default translation token value (\263) clashes with one of common characters in polish language, changing them to '&' during page rendering (though they are correctly stored and visible in page editor). Is the value of TranslationToken important? Will I break something by changing it to something that doesn't clash with anything I use (say, \264)?

-- MarcinKaszynski - 04 Oct 2002

Maybe it got missed... Anyway, you should be fine with another character, or a string of several unusual characters as suggested above. '__this_is_translation_token__' should be OK, but of course pages that include that value will be munged - best to first check for that value and convert to a random set of digits and letters (N), then do the translation, then translate N back again. In practice, this would be quite rare.

-- RichardDonkin - 05 Oct 2002

Now fixed in TWikiAlphaRelease - this is really a bug, since it is possible to pick a translation token that works for any 8-bit character set (i.e. \0, the NUL character in all character sets hopefully, certainly all 8 bit ones I know of). NUL is a valid Perl character but quite hard to type into a browser or even cut/paste - I tried smile ).

Can someone who knows about multi-byte characters tell me whether NUL is a valid character as part of any multi-byte character strings? Someone local who knows about this says that it is not used, because the C language is unable to handle NUL characters and lots of programs would break.

If NUL is no good for some reason, there is a place in the TWiki.pm code where you can set it to be a multi-character string, which can be a nonsense word with some unusual characters such as control characters. This should enable any multi-byte character encoding to work, even if some character strings include NUL.

-- RichardDonkin - 19 Nov 2002

I've done a bit more research, and at least three of the most common Japanese charsets don't use null bytes within their codings. Also, UTF-8 doesn't use NUL except when encoding ASCII NUL. Embedded NULs were probably used in legacy double-byte character sets but don't seem common these days.

So the current (Feb2003 onwards) code should be fine for virtually all charsets - any NULs in the original text are removed, but it's most unlikely they would get in there in the first place.

-- RichardDonkin - 05 Sep 2003

Edit | Attach | Watch | Print version | History: r7 < r6 < r5 < r4 < r3 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r7 - 2003-09-05 - RichardDonkin
 
  • Learn about TWiki  
  • Download TWiki
This site is powered by the TWiki collaboration platform Powered by Perl Hosted by OICcam.com Ideas, requests, problems regarding TWiki? Send feedback. Ask community in the support forum.
Copyright © 1999-2026 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.