Question
I found this bug recently. Whenever I try to edit a new wikipage for an odd-number chinese name included by double brackets
, for exampe
同学录, it would display wrong ecoded page something like this:
闅忔墜璐PATH_TRANSLATED=c:\easyphp\www\personalarea\闅忔墜璐 (edit)
I was so confused for a while because some chinese wiki links work and other do not. Finally, I figured it out that I have to use even number of chinese character to make it work.
Weird enough, isn't it?
- NOTE: Problem was in fact use of GB2312 encoding, and not related to use of odd vs. even Chinese character strings -- RD
Environment
--
ChunhuaLiao - 26 Oct 2004
Answer
Interesting - can you post details of your config as per
SupportGuidelines, ideally with attachment of
TWiki.cfg and your
testenv HTML output, or a URL to your testenv?
From your comments on
RegisterFailureInsecureDependency, I would guess you are using
$siteLocale of
zh_CN.gb2312 - there may be a problem with specific characters in that encoding if the first or second byte in the character can be interpreted as a 7 bit ASCII character, though I thought gb2312 was 'ASCII safe'. If you can provide an attachment of the exact topic .txt file that shows the error, along with some test cases that work, I should be able to have a look at this.
I had a quick look in the
CJKV book
(highly recommended) and it appears that GBK, which is quite closely related to GB2312, is not ASCII-safe, since the second byte in GBK characters starts at 40 hex (64 decimal), which overlaps with ASCII. I'll try to confirm this for GB2312 when I get time, but if you can send me the text that causes the problem, as an attachment here, that would help me confirm what's going on (the Chinese characters above have been turned into Unicode Numeric Character References since TWiki.org does not support GB2312).
So, my initial thoughts are that you need to use an ASCII-safe Chinese encoding - EUC-CN would definitely work if you can handle that in your environment. Alternatively, you could try using UTF-8, which can be used today in a similar way, although it is not handled specially (i.e. sorting and some other things won't work too well - see
SupportForUTF8).
WikiWords would not work with either EUC-CN or UTF-8 but that doesn't matter for Chinese support.
Let me know how you get on - it's good to have direct contact with Chinese users since I largely developed the
I18N feature without feedback from East Asian users.
--
RichardDonkin - 26 Oct 2004
I just tried my another instance of Twiki on Gentoo Linux. It works fine for odd number of Chinese wiki link. Basically, I am using the same zh_CN.gb2312 ecoding for both of them.
The Twiki in question is installed on my Windows PC with cygwin and apache bundled into Easyphp. I am looking into the reason and will keep you updated.
--
ChunhuaLiao - 26 Oct 2004
Locales don't work on Cygwin but you have probably configured around this using 'non-locale regex' mode. If you could provide a test case file (topic.txt), TWiki.cfg and testenv, as mentioned above, that would help in diagnosing this problem. I suspect it may not be just the odd or even number of Chinese characters causing this, but I need specific examples
in GB2312 encoding, not modified by insertion into the TWiki.org page, to be sure. A .txt attachment would be the best way around this, since the text 同学录 inserted above is actually encoded as
同学录 here at TWiki.org.
I'd also like to know which browser you are using. I have just tested creating a TWiki page with three GB2312 characters as the name - see links from
my Chinese test page
. My 'testbin' TWiki at that URL is now running in GB2312 and works OK with the text I have used - please supply some text for the link (using GB2312) that will make it break!
--
RichardDonkin - 26 Oct 2004
You are right. It has nothing to do with the number of Chinese characters. I attached the files you requested and screen snapshots for the errors. Hope they are helpful.
By the way, the attachment table in this page only shows my last attachment. I don't know why. You have to go to the revision information of the attachment table to get all the files I tried to upload. Another strange thing for me.
--
ChunhuaLiao - 26 Oct 2004
Thanks for the extra info. I think what happened with the attachments is that you uploaded all the files 'on top of' one file. That's why they all download as plain text, but they're still there anyway so no real problem. Using the 'Attach image or document' link for each new attachment would avoid this.
I've done some more reading - both GBK and GB2312-80 are not ASCII-safe, so certain 2 byte Chinese characters will include 7-bit ASCII bytes that get confused with other TWiki characters (e.g. space,
], etc.)
The only solution is to use a safe character encoding - EUC-CN will work fine, and can encode GB2312 without problems. Alternatively, you could just use UTF-8, though be sure to remove the
writeWarning in
SVNget:lib/TWiki.pm
, where use of UTF-8 character encoding in
$siteCharset is checked, that otherwise fills up the
warning.txt file - the line to comment out is:
writeWarning "UTF-8 not yet supported as site charset - TWiki is likely to have problems";
WikiWords will work for English text only using UTF-8 at present, but that shouldn't matter. I know a Japanese site that was using UTF-8 for a long time so most things should work, but the UTF-8 code is not really finished - so EUC-CN is likely to work more smoothly.
I'll update the docs, including
InternationalisationIssues and
JapaneseAndChineseSupport, and modify the code to put GB2312 and others on the excluded list. Unfortunately most commonly used CJKV character sets are not in fact safe to use with TWiki, apart from EUC-CN, EUC-TW, EUC-JP, EUC-KR and UTF-8, because of this 'second byte ASCII' issue. Sorry for not spotting this earlier - I did do a lot of searching for 'ASCII safe' character sets once I realised a similar issue with ISO-2022-JP, but it's hard to find this information on the Web.
Also, I'll rename this topic to
SomeChineseCharactersBreakWikiLinks.
--
RichardDonkin - 27 Oct 2004
Now fixed in SVN for
DakarRelease, see
Codev.SomeChineseCharactersBreakWikiLinks.
--
RichardDonkin - 28 Oct 2004
I am staying with GB2312 to keep existing chinese characters scattered in my pages readable. Using English wikiword only is no problem for my everyday usage.
--
ChunhuaLiao - 02 Nov 2004
You can just do a bulk conversion from GB2312 to UTF-8, and your characters will still be readable - lots of Chinese sites use UTF-8 now. This issue is not just with
WikiWord links - it can also affect any Chinese character that includes an ASCII byte using TWiki syntax. Hard to predict exactly what will happen, but that's why it's easier to use UTF-8.
--
RichardDonkin - 11 Apr 2006