Translation Token Clashes With 2-Byte Characters
Well, I'm using Korean as default language and found the $TranslationToken clashes with some Korean (2-byte) letters such as _B3_AA (sounds 'nah').
Words that include such letters became invisible in web pages, although stored correctly in file. Of course I coudn't use TWiki with this clash and I began thinking it will better if the $TranslationToken be longer, hard-to-clash word. So I set $TranslationToken as '__this_is_translation_token__' and the problem solved.
$TranslationToken of 1-byte length will be somewhat dangerous in multi-byte charset environment, I think. What if default $TranslationToken be more long, clash-safe string? Maybe this must be dicussed in the context of localization or
I18N topics.
--
JikhanJung - 26 Oct 2001
I'm somewhat surprised nobody cared to answer this, let alone change
this question's status to feature request
The default translation token value (\263) clashes with one of common characters
in polish language, changing them to '&' during page rendering (though they
are correctly stored and visible in page editor). Is the value of
TranslationToken important? Will I break something by changing it to something
that doesn't clash with anything I use (say, \264)?
--
MarcinKaszynski - 04 Oct 2002
Maybe it got missed... Anyway, you should be fine with another character, or a string of several unusual characters as suggested above. '__this_is_translation_token__' should be OK, but of course pages that include that value will be munged - best to first check for that value and convert to a random set of digits and letters (N), then do the translation, then translate N back again. In practice, this would be quite rare.
--
RichardDonkin - 05 Oct 2002
Now fixed in
TWikiAlphaRelease - this is really a bug, since it is possible to pick a translation token that works for any 8-bit character set (i.e.
\0, the NUL character in all character sets hopefully, certainly all 8 bit ones I know of). NUL is a valid Perl character but quite hard to type into a browser or even cut/paste - I tried

).
Can someone who knows about multi-byte characters tell me whether NUL is a valid character as part of any multi-byte character strings? Someone local who knows about this says that it is not used, because the C language is unable to handle NUL characters and lots of programs would break.
If NUL is no good for some reason, there is a place in the
TWiki.pm code where you can set it to be a multi-character string, which can be a nonsense word with some unusual characters such as control characters. This should enable any multi-byte character encoding to work, even if some character strings include NUL.
--
RichardDonkin - 19 Nov 2002
I've done a bit more research, and at least three of the most common Japanese charsets
don't use null bytes
within their codings. Also, UTF-8
doesn't use NUL
except when encoding ASCII NUL. Embedded NULs were probably used in legacy double-byte character sets but don't seem common these days.
So the current (Feb2003 onwards) code should be fine for virtually all charsets - any NULs in the original text are removed, but it's most unlikely they would get in there in the first place.
--
RichardDonkin - 05 Sep 2003