By
Stefan Lindmark (stefan.lindmark /at/ sun.com)
International 8-bit ISO8859-1 characters aren't working very well in
WikiWords. The frequency of characters like åäö in the Swedish languange as well as other European languages makes TWiki difficult to use for non-skilled personnel.
I've done a very quick hack just to see how difficult it would be to modify TWiki to work with 8-bit characters in arbitrary places within
WikiWords. Very rough, I just used sed to change all appearances of the string "A-Z" to "A-ZÅÄÖÜ" and "a-z" to "a-zåäöü".
After this change, I've played around for a while and tried different versions of links with & without square brackets. It seems to work in all places I've put them 8-bit
WikiWords, so I'm waiting for someone to point out where the flaw in my brute-force approach could be.
Anyway, the change doesn't seem
that complex, although the notion of
word character is defined in numerous regexps across the code. Would it be possible to incorporate a change like this in the mainstream branch of the product?
This specific change would certainly benefit Swedish and German users. Other characters may easily be added to enable more languages.
This was tested with TWiki 2001-12-01 running on Apache on Solaris 8. File names seem to be generated quite alright.
See also
CaseInsensitiveSearchInternational
--
TWikiGuest - 30 Jan 2002
This feature would be also extremely usefull for Russian and other languages with Cyrillic alphabet.
--
MaximRomashchenko - 21 May 2002
What about Chinese? or, indeed, languages like Thai that don't have capital letters? I suppose they use square bracket notation.
Good idea anyway. Perhaps you could define these characters (ÀÁÂ...) as capital as standard.
--
MartinEllison - 21 May 2002
One relatively easy way of doing this for all 8-bit languages is to use
[:alnum:] in TWiki code where regular expressions such as
[A-Z0-9] are used. However, this would require that TWiki targets Perl 5.6.1, and many installations of TWiki still use Perl 5.005_03, so perhaps some other approach will be needed.
--
RichardDonkin - 22 May 2002
I'm just wondering if anyone got working TWeak with cyrillic? After 2 days of applying all the advices I found here, I'm asking here Tweak comunity who use cyrillic.
--
EugeneKuzmitsky 22 May 2002
So, my question was resolved by editing lib/TWiki.pm and changing one line of code. In sub writeHeader we change print $query->header(); to print $query->header(-charset=>'windows-1251');. And it works fine. All forms are now in proper codepage.
--
EugeneKuzmitsky - 24 May 2002
Maybe it deserves to be new
TWiki.cfg parameter? CHARSET?
--
PeterMasiar - 24 May 2002
Anyway, it's much better than make changes for each of 'problematic' charsets in perl modules.
--
EugeneKuzmitsky - 25 May 2002
OK, to add one more thing to this problem: we have in Hungarian áéíóöőúüű and ÁÉÍÓÖŐÚÜŰ characters and I am not quite sure if
[:alnum:] supports them. Maybe there should be two definitions in the config file: one for acceptable capital letters (
[A-Z] by default,
[A-ZÁÉÍÓÖŐÚÜŰ] in my case) and one for non-capital letters. BTW, setting the charset parameter in writeHeader doesn't work, if any plugin writes the header. So the charset config parameter is also justified, I think.
Actually, I played around with national characters, and found, that even with the best constellation I could set up, still didn't work. I used a
ActiveState Perl which supports Hungarian codepage (Windows 1250) and set up both the Apache server to use it as default, also modified the twiki lib package to use this codepage. I could edit the topics with national characters, but I received an error message, when I tried to preview or save:
Server error!
Error message:
couldn't create child process: 22: C:/twiki/bin/preview
Still I can edit all the topics
without national characters. Any suggestion?
--
DavidKosa - 15 Jun 2002
I need to get swedish characters into the
WikiWords, but am a little unsure on how to do it. Which file(s) contain the
WikiWord definition? Could anyone help me with the sed command as well?
On my last job we used
PikiPiki (wiki with python), there I added support for swedish characters by changing the
WikiWord definition. There we only had ONE instance of the definition making it very easy to change the definition. But here it seems that every file has their own definitions, wouldnt it be better to have one file defining the
WikiWord and then let other files include that file?
--
ErikMattsson - 04 Jul 2002
It looks very much as if it is time for TWiki to switch to Unicode, at least internally (letting its efficacy on browsers depend on the capability of the browser). That would allow pretty well
all of the world's characters to occur. And Unicode does have a standard set of letters -- or should I say syllables -- no actually I mean symbols used in words. But you's have to use square brackets for the
WikiWords that use scripts with no upper/lower case distinction. And the UTF-8 encoding of Unicode is entirely compatible with the old 96-character ASCII encoding that most of us use.
--
HendrikBoom - 19 Jul 2002
I observed that TWiki will produce increasingly strange file/link names if you enclose a german umlaut (and most
likely any other national special character) in [ [ ... ] ]. A error message would have been OK (no special chacters
in WikiWords allowed ...). However I deceided to allow umlaute in [ [ ... ] ]. For a (mostly) german site this
seems essential to me. I attached a patch to this page, which fixed the issue for me.
Some theory about the problem with international character sets: In fact, we have 3 problems. The user types
in some characters on his keyboard, which are stored as 8-bit-ASCII (ISO) on the server. This raw data must
be converted into 3 different forms:
- A legal/portable HTML representation.
- A legal part of a path in an URL (only WikiWords).
- A legal/portable (UNIX) file name (only WikiWords, ever handled a tar archive which contained filenames with umlaute in HP's old ROMAN8 encoding ;-).
The solution for the first problem is clear:
HTML Charcters Entities is the only way to go (example ä
should be ä). For the other two cases solutions have been proposed to code the characters based on
there hexadecimal or octal representation (į, =F5 or similar). This looks ugly and is hard to read
for humans. Therefore I suggest
transscription for this 2 cases.
What is transscription? In germany it happend often that imported typewritters (or computers or printers ...) were
only able to produce latin characters. As a result there is a well-established methode to encode german umlaute
in latin characters. For example a german who lacks an
ä on his keyboard would write
ae instead. And that
is exactly what we need for file/path names. My patch does this conversion for german umlaute. The 2 conversion
tables are coded as hashs. So it should be easy to extend it to other languages as needed.
--
HaraldWilhelmi - 14 Aug 2002
I found a minor typo in my translation table and updated my patch.
--
HaraldWilhelmi - 21 Aug 2002
I didn't quite understand your proposal for transcription initially, but it seems you are only looking at this for URLs and filenames. My only concern is that there may not be transcriptions available in some languages, and it is language-specific. So it might be better to just live with ugly-looking URLs - filenames are invisible to the user anyway.
All this would require Perl 5.6.1, which has good support for internationalised character classes for use in regexes. This should probably be configurable, since many webhosts and older Linux distros still ship Perl 5.005_03.
It would be best if the same character set is used server-wide, so that all tools called by TWiki use the same definition of 'letter' (e.g.
grep as used by the Search feature), and for consistency when creating filenames that must match a
WikiWord in a TWiki page. Presumably URLs including high-8-bit characters must be URL-encoded?
--
RichardDonkin - 21 Aug 2002
Harald, I did the same thing with my
AtisWiki I use on a free site. I stopped using my experimental TWiki because I couldn't modify it to handle Hungarian characters. Hope your patch works. Thank you!
--
DavidKosa - 21 Nov 2002
I'm doing some work on TWiki internationalisation, mainly for 8-bit character support in
WikiWords and for searching. So watch this space

Comments on the highest priority features are welcome - my current plan is to make this work on any Perl 5.005_03 or higher, and have a workaround for broken locale support (e.g.
CygWin) as well as a nicer approach for those whose locales are set correctly. The main issue is the URL encoding and decoding, and maintaining two versions of the topic name as a result, which could be a fair amount of work.
I'm not planning on using transcription as it is different for every language - URL encoding results in somewhat ugly URLs, but I assume this is OK as long as the displayed page name and the
WikiWords are using full 8-bit characters. Hope that's OK...
--
RichardDonkin - 22 Nov 2002
Now implemented in
TWikiAlphaRelease - please try this out. See
InternationalisationEnhancements for details and important browser setup, and
http://donkin.org/bin/view/Test/TestTopic5
for a test page running this code.
--
RichardDonkin - 30 Nov 2002