Question
Summary: Some versions of CPAN:CGI
(CGI.pm) cause problems on Windows with Cygwin Perl, though ActiveState Perl seems to be OK. Not a TWiki bug but we need to document which versions of CGI.pm cause this issue. -- RD
As per request in
GermanUmlauteAndWindows, I moved this into a seperate topic.
The following error shows on both Windows Server 2003 and Windows XP. Each installation was done by following the instructions described in the
WindowsInstallCookbook. The latest Cygwin/Perl and the latest Apache in the 1.3xx-branch are being used. All necessary Perl-modules as per output of
testenv have been installed. UTF-8-encoding of urls has been enabled in all browsers. Settings in
TWiki.cfg and output of
testenv can be checked in these attached files:
Error: When clicking on a non-existant
WikiWord that contains umlauts, the following edit-screen shows other characters than the umlauts. Saving the topic results in a "NOTE: This Wiki topic does not exist yet"-page, again with altered characters. See the following screenshots:
- If you click on the highlighted WikiWord...
- ...you get this on the edit-screen.
- Url is displayed correct, though.
- After saving, you get this:
I've tried all possible and non-possible combinations of TWiki-configuration-variables to no avail. I also tried to debug the code in TWiki.pm, edit and view, maily the utf-8-implementation. Seemed all to work ok.
Finally I've found something in the bug-section of CPAN. It looks like CGI.pm has trouble with utf-8.
Read more about that here
. Could this be the cause of all the trouble? And what could be a workaround?
Environment
--
JoachimBlum - 23 Oct 2005
Answer
If you answer a question - or someone answered one of your questions - please remember to edit the page and set the status to answered. The status selector is below the edit box.
Hello? Suggestions? Anyone?
Heeeelp!
--
JoachimBlum - 02 Nov 2005
Thanks for providing a complete set of relevant info. First of all, have you applied the two patches linked from
InternationalisationIssues (last bullet in the fixed issues for 01Sep2004 list)? Without that, TWiki
I18N is quite broken.
To other CoreTeam members: any chance of a maintenance release to Sep2004 code so that at least basic
I18N has a chance of working without patches?
TWiki does its own UTF-8 URL encoding so I'd be surprised if a CGI.pm bug is an issue here. However, recent versions of CGI.pm may have broken things - I use CGI.pm v3.04 OK on Linux, so perhaps you could downgrade to that.
You might want to investigate the Apache request log, and check whether you have any proxies between TWiki and browser (probably not on the XP box!). Also, the suggestions and patches in
GermanUmlauteOnWindows will help in debugging what's going on here since you're OK with looking at the Perl code. If the UTF-8 URL decoding support (
EncodeURLsWithUTF8) is not working for some reason, it should be evident from suitable
writeDebug statements.
One other idea: since you already have the Unicode::MapUTF8 modules installed, you could try tweaking the version check code in the UTF-8 URL encoding routine so that you use this module - just using the Perl 5.6 part of the conversion should be OK.
You could also try using ISO-8859-1 as the character set - would lose the Euro but doesn't require any conversion modules and is a useful debugging step.
--
RichardDonkin - 05 Nov 2005
Good news: Finally I got it to work. Two things did the trick:
- Changed the Perl-interpreter from Cygwin 5.8.7 to ActivePerl 5.8.7
-
Patched the $regex{validUtf8StringRegex} in TWiki.pm, line 661 as described below
Here's the patch that I applied to $regex{validUtf8StringRegex} in line 661 of TWiki.pm:
- Old:
qr/^ (?: $regex{validUtf8CharRegex} ) $/x
- New:
qr/^ (?: $regex{validUtf8CharRegex} )+ $/x (Note the added '+' between ')' and ' $').
The regex wouldn't work in its original form. No utf-8-compliant url would be recognized and so
convertUtf8URLtoSiteCharset() would always fail. I don't know regexes good enough to give an answer why this happens. Just a guess: '?' matches exactly 1 or 0 times. So, any string that is longer than 1 character would not be matched. The '+'-modifier on the other hand says "match 1 or more times", so it extends the scope of the regex to the whole string. This was one killer. The other one was Cygwin Perl which doesn't fully support locales. Utf-8-encoded urls would appear false in scripts. For instance, the test case TestUmläute would appear encoded as TestUml\xC7\xB3ute, which is not valid utf-8. The correct encoding is TestUml\xC3\xA4ute which is exactly what ActivePerl delivers.
Could someone please verify my findings?
Status of this question changed to
answered.
--
JoachimBlum - 08 Nov 2005
Not sure why changing from Cygwin would have an effect - perhaps because of the associated
CPAN:Encode
version, which is what would be used with Perl 5.8.
Please submit a
Codev.BugReport that points to this topic - your patches are important and I'd like to get them into
DakarRelease.
--
RichardDonkin - 09 Nov 2005
I think it's not
CPAN:Encode
that has a flaw, I think it's
CPAN:CGI
, because the false encoded TestUml\xC7\xB3ute comes from the
$thePathInfo = $cgi->path_info(); -call in
view/edit/.... As soon as I changed to ActivePerl, the url appeared encoded correct.
BugReport has been sent.
--
JoachimBlum - 09 Nov 2005
Did you try upgrading or downgrading the
CPAN:CGI
module? CGI.pm is pure Perl so you should be able to just copy the ActivePerl CGI.pm into the Cygwin Perl library path (twiki
lib directory should work for testing purposes).
Could you provide the CGI.pm versions you use on the two Perl variants? Doing the following should work in both Cygwin
bash and Windows
cmd.exe (for
ActivePerl):
perl -e "use CGI; print \$CGI::VERSION"
By the way, the regex you showed as 'new' above is the same as the version I have in the 02Sep2004 code - perhaps the '+' was deleted by mistake, but it is included in the TWiki release, and if omitted would have broken all UTF-8 URLs of course.
--
RichardDonkin - 11 Nov 2005
Ok, I have to make an apology here. I was victim of my own desperation. Of course, the regex in the current 04Sep2004 code (which I'm using) is correct. The '+' got deleted by myself when I temporarily changed the regex to match always in order to test
convertUtf8URLtoSiteCharset(). When I changed it back, somehow the '+' didn't make it back. Sorry for that.
That leaves us with the error of wrongly encoded urls. CGI.pm versions of both Cygwin and ActiveState are 3.10. I haven't tried up- or downgrading yet, maybe that's an option, although I guess I'll stay with ActiveState anyways because it's got a better performance.
Again, sorry for my dumbness and all the fuss it created.
--
JoachimBlum - 14 Nov 2005
Hi Joachim - easy mistake to make. I try to always copy and comment out lines like that, especially when not using a version control system (you can of course just use
RCS for simple version control, just type
ci -l to checkin and lock a Perl file).
It would be good to know which CGI.pm version broke things, particularly on the Cygwin side, so if you get time to try an older version or two on Cygwin that would help other people avoid this problem. If you are short of time, perhaps you could just try CGI.pm version 3.04, which I know works on Linux.
For now, I'll link to this from
InternationalisationIssues.
--
RichardDonkin - 15 Nov 2005