InternationalCharactersInWikiWords < Codev

By Stefan Lindmark (stefan.lindmark /at/ sun.com)

International 8-bit ISO8859-1 characters aren't working very well in WikiWords. The frequency of characters like �� in the Swedish languange as well as other European languages makes TWiki difficult to use for non-skilled personnel.

I've done a very quick hack just to see how difficult it would be to modify TWiki to work with 8-bit characters in arbitrary places within WikiWords. Very rough, I just used sed to change all appearances of the string "A-Z" to "A-Z��" and "a-z" to "a-z��".

After this change, I've played around for a while and tried different versions of links with & without square brackets. It seems to work in all places I've put them 8-bit WikiWords, so I'm waiting for someone to point out where the flaw in my brute-force approach could be.

Anyway, the change doesn't seem that complex, although the notion of word character is defined in numerous regexps across the code. Would it be possible to incorporate a change like this in the mainstream branch of the product?

This specific change would certainly benefit Swedish and German users. Other characters may easily be added to enable more languages.

This was tested with TWiki 2001-12-01 running on Apache on Solaris 8. File names seem to be generated quite alright.

-- TWikiGuest - 30 Jan 2002

This feature would be also extremely usefull for Russian and other languages with Cyrillic alphabet.

-- MaximRomashchenko - 21 May 2002

What about Chinese? or, indeed, languages like Thai that don't have capital letters? I suppose they use square bracket notation.

Good idea anyway. Perhaps you could define these characters (ÀÁÂ...) as capital as standard.

-- MartinEllison - 21 May 2002

One relatively easy way of doing this for all 8-bit languages is to use [:alnum:] in TWiki code where regular expressions such as [A-Z0-9] are used. However, this would require that TWiki targets Perl 5.6.1, and many installations of TWiki still use Perl 5.005_03, so perhaps some other approach will be needed.

-- RichardDonkin - 22 May 2002

I'm just wondering if anyone got working TWeak with cyrillic? After 2 days of applying all the advices I found here, I'm asking here Tweak comunity who use cyrillic.

-- EugeneKuzmitsky 22 May 2002

So, my question was resolved by editing lib/TWiki.pm and changing one line of code. In sub writeHeader we change print $query->header(); to print $query->header(-charset=>'windows-1251');. And it works fine. All forms are now in proper codepage.

-- EugeneKuzmitsky - 24 May 2002

Maybe it deserves to be new TWiki.cfg parameter? CHARSET?

-- PeterMasiar - 24 May 2002

Anyway, it's much better than make changes for each of 'problematic' charsets in perl modules.

-- EugeneKuzmitsky - 25 May 2002

OK, to add one more thing to this problem: we have in Hungarian ��ő��ű and ��Ő��Ű characters and I am not quite sure if [:alnum:] supports them. Maybe there should be two definitions in the config file: one for acceptable capital letters ([A-Z] by default, [A-Z��Ő��Ű] in my case) and one for non-capital letters. BTW, setting the charset parameter in writeHeader doesn't work, if any plugin writes the header. So the charset config parameter is also justified, I think.

Actually, I played around with national characters, and found, that even with the best constellation I could set up, still didn't work. I used a ActiveState Perl which supports Hungarian codepage (Windows 1250) and set up both the Apache server to use it as default, also modified the twiki lib package to use this codepage. I could edit the topics with national characters, but I received an error message, when I tried to preview or save:

Server error!
Error message: 
couldn't create child process: 22: C:/twiki/bin/preview

Still I can edit all the topics without national characters. Any suggestion?

-- DavidKosa - 15 Jun 2002

I need to get swedish characters into the WikiWords, but am a little unsure on how to do it. Which file(s) contain the WikiWord definition? Could anyone help me with the sed command as well?

On my last job we used PikiPiki (wiki with python), there I added support for swedish characters by changing the WikiWord definition. There we only had ONE instance of the definition making it very easy to change the definition. But here it seems that every file has their own definitions, wouldnt it be better to have one file defining the WikiWord and then let other files include that file?

-- ErikMattsson - 04 Jul 2002

It looks very much as if it is time for TWiki to switch to Unicode, at least internally (letting its efficacy on browsers depend on the capability of the browser). That would allow pretty well all of the world's characters to occur. And Unicode does have a standard set of letters -- or should I say syllables -- no actually I mean symbols used in words. But you's have to use square brackets for the WikiWords that use scripts with no upper/lower case distinction. And the UTF-8 encoding of Unicode is entirely compatible with the old 96-character ASCII encoding that most of us use.

-- HendrikBoom - 19 Jul 2002

I observed that TWiki will produce increasingly strange file/link names if you enclose a german umlaut (and most likely any other national special character) in [ [ ... ] ]. A error message would have been OK (no special chacters in WikiWords allowed ...). However I deceided to allow umlaute in [ [ ... ] ]. For a (mostly) german site this seems essential to me. I attached a patch to this page, which fixed the issue for me.

Some theory about the problem with international character sets: In fact, we have 3 problems. The user types in some characters on his keyboard, which are stored as 8-bit-ASCII (ISO) on the server. This raw data must be converted into 3 different forms:

A legal/portable HTML representation.
A legal part of a path in an URL (only WikiWords).
A legal/portable (UNIX) file name (only WikiWords, ever handled a tar archive which contained filenames with umlaute in HP's old ROMAN8 encoding ;-).

The solution for the first problem is clear: HTML Charcters Entities is the only way to go (example ä should be ä). For the other two cases solutions have been proposed to code the characters based on there hexadecimal or octal representation (&#303, =F5 or similar). This looks ugly and is hard to read for humans. Therefore I suggest transscription for this 2 cases.

What is transscription? In germany it happend often that imported typewritters (or computers or printers ...) were only able to produce latin characters. As a result there is a well-established methode to encode german umlaute in latin characters. For example a german who lacks an ä on his keyboard would write ae instead. And that is exactly what we need for file/path names. My patch does this conversion for german umlaute. The 2 conversion tables are coded as hashs. So it should be easy to extend it to other languages as needed.

-- HaraldWilhelmi - 14 Aug 2002

I found a minor typo in my translation table and updated my patch.

-- HaraldWilhelmi - 21 Aug 2002

I didn't quite understand your proposal for transcription initially, but it seems you are only looking at this for URLs and filenames. My only concern is that there may not be transcriptions available in some languages, and it is language-specific. So it might be better to just live with ugly-looking URLs - filenames are invisible to the user anyway.

All this would require Perl 5.6.1, which has good support for internationalised character classes for use in regexes. This should probably be configurable, since many webhosts and older Linux distros still ship Perl 5.005_03.

It would be best if the same character set is used server-wide, so that all tools called by TWiki use the same definition of 'letter' (e.g. grep as used by the Search feature), and for consistency when creating filenames that must match a WikiWord in a TWiki page. Presumably URLs including high-8-bit characters must be URL-encoded?

-- RichardDonkin - 21 Aug 2002

Harald, I did the same thing with my AtisWiki I use on a free site. I stopped using my experimental TWiki because I couldn't modify it to handle Hungarian characters. Hope your patch works. Thank you!

-- DavidKosa - 21 Nov 2002

I'm doing some work on TWiki internationalisation, mainly for 8-bit character support in WikiWords and for searching. So watch this space smile Comments on the highest priority features are welcome - my current plan is to make this work on any Perl 5.005_03 or higher, and have a workaround for broken locale support (e.g. CygWin) as well as a nicer approach for those whose locales are set correctly. The main issue is the URL encoding and decoding, and maintaining two versions of the topic name as a result, which could be a fair amount of work.

I'm not planning on using transcription as it is different for every language - URL encoding results in somewhat ugly URLs, but I assume this is OK as long as the displayed page name and the WikiWords are using full 8-bit characters. Hope that's OK...

-- RichardDonkin - 22 Nov 2002

Now implemented in TWikiAlphaRelease - please try this out. See InternationalisationEnhancements for details and important browser setup, and http://donkin.org/bin/view/Test/TestTopic5 for a test page running this code.

-- RichardDonkin - 30 Nov 2002

WebForm
TopicClassification	FeatureDone

Attachments

Topic attachments
I	Attachment	History	Action	Size	Date	Who	Comment
diff	lib_TWiki.pm.diff	r2 r1	manage	1.4 K	2002-08-21 - 08:57	UnknownUser	Patch for lib/TWiki.pm to handle german Umlaute

Topic revision: r19 - 2002-11-30 - RichardDonkin

Account
- Log In
- Register User

Edit
Attach

Ideas, requests, problems regarding TWiki? Send feedback. Ask community in the support forum.
Copyright © 1999-2025 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.