Tags:
archive_me1Add my vote for this tag extract_doc1Add my vote for this tag internationalization1Add my vote for this tag create new tag
, view all tags

Bug: Apache 2.0 Breaks Non-UTF-8 Encoded URLs on Windows

Summary

When using Firefox or similar browsers that don't send UTF-8 URLs by default, international characters in WikiWords (as per InternationalisationEnhancements) don't work in Apache 2.0.52 on Windows.

Partially fixed in Apache 2.0.54 for Windows but problems still occurring (see latest comments) - does not occur on other platforms.

Details

The problem is that whenever Apache sees non-UTF-8 URLs (e.g. ISO-8859-1 URL-encoded with % escapes), it converts these to UCS-2 (two-byte Unicode format) before trying to pass them to TWiki. The conversion attempt includes the PATH_INFO (e.g. Codev/ThisTopic) that TWiki uses as the name of the topic, and fails because the URL is not valid UTF-8.

The result is that the TWiki code never sees the encoded URL and the page is inaccessible. The server gives a '500 internal server error' message and Apache error.log has this line for an ISO-8859-1 test case:

(22)Invalid argument: utf8 to ucs2 conversion failed on this string: PATH_INFO=/Main/FromageD\xe9rap\xe9

This conversion is driven by new support for Windows Unicode filesystem APIs in Apache 2.0, and support for UTF-8 URLs (IRIs) on Windows, even though PATH_INFO at this point has nothing to do with the filesystem. (This is probably an Apache bug, since there's no obvious way to turn this off - I have not yet checked if already reported but there are some similar possible bugs, e.g. ApacheBug:9223, ApacheBug:13029, ApacheBug:18805 and ApacheBug:20855, that result from Apache 2 assuming strings are UTF-8.)

-- RichardDonkin - 09 Dec 2004

Test case

1. Use any I18N WikiWord, e.g. FromageDérapé, from Firefox 1.0 (or IE/Opera configured to not use UTF-8 URLs).

  • You may not even need TWiki installed, so if you have admin access to an Apache 2.0 server, do try this out
2. See the error message and check Apache error.log file

Example Error

When using PHP's URL encoding function to access files with international characters a 404 file not found is returned. When first UTF-8 encoding the url and then URL encoding it - it works fine. Note the encoding differences in the URL: R%EAve and R%C3%AAve.

Examples from log file:

xxx.xxx.xxx.xxx - - [01/Jan/2005:18:23:03 +0100] "GET /Idir/Deux%20Rives%2C%20un%20R%EAve/01%20-%20Pourquoi%20cette%20pluie%20%20.mp3 HTTP/1.0" 404 260 "-" "WinampMPEG/2.9" "-"

xxx.xxx.xxx.xxx - - [01/Jan/2005:21:12:16 +0100] "GET /Idir/Deux%20Rives%2C%20un%20R%C3%AAve/01%20-%20Pourquoi%20cette%20pluie%20%20.mp3 HTTP/1.0" 200 8164000 "-" "WinampMPEG/5.0" "-"

-- FrancisLee - 06 Jan 2005

Environment

TWiki version: TWikiRelease02Sep2004
TWiki plugins: DefaultPlugin, EmptyPlugin, InterwikiPlugin
Server OS: Windows XP
Web server: Apache 2.0.52 (XAMPP for Windows 1.4.9)
Perl version: ActivePerl 5.8.4 (XAMPP for Windows 1.4.9)
Client OS: Windows XP SP2
Web Browser: Firefox 1.0, Mozilla (most versions), Gecko-based browsers, etc

-- RichardDonkin - 09 Dec 2004

Fix

Install Apache 2.0.54 for Windows, which fixes this bug.

Follow up and Workarounds

Various possible workarounds are:

  • Stay with Apache 1.3 on Windows (recommended in WindowsInstallCookbook and TWikiSystemRequirements)
  • Set browsers to use only UTF-8 URLs (Firefox, Mozilla, etc)
    • For Firefox, go to about:config in URL bar, then filter by utf, then double click network.standard-url.encode-utf8 to set this to true (or just edit userprefs.js). Mozilla should be similar. NOTE: Firefox 1.0 has a bug that prevents this state being persistent, so you might need to reset it on every browser session - MozillaBug:261934 (UPDATED: fixed in Firefox 1.0.1)

UTF-8 URLs do work fine at least for page views and don't produce the same internal server error or message in the error log.

Thanks to HenningRuch for encouragement to download XAMPP and find this problem. See WebSearchProblemWithCygwinAndXAMPP for a XAMPP issue not related to I18N.

-- RichardDonkin - 10 Dec 2004, 30 April 2005

Here's the reply from Martin Duerst of the W3C, author of mod_fileiri and expert on IRIs (internationalised resource identifiers, which can map into UTF-8 URLs/URIs) - posted with his permission:

... Re using mod_fileiri to work around this bug ...

I haven't tried this out. I originally wrote mod_fileiri for Linux servers, where it's unclear what encoding is used for directory and file names. It's possible to configure mod_fileiri to have filenames in UTF-8, and accept incoming requests in a legacy encoding and redirect to the corresponding UTF-8 file. I made this possible so that existing Web servers could be converted from using a legacy encoding for their URIs to using UTF-8, while old URIs would still work. The conditions for this to work are that only one legacy encoding can be used, that there are no collisions between legacy-encoded filenames and UTF-8 encoded filenames (unless there is a large number of files with some weird names, that's usually not a problem at all), and that the site still works with (usually permanent) redirects.

Looking at the bug description, '500 internal server error' looks scary. A legal (according to the URI spec) URI should not produce an internal server error. Whether the URI is UTF-8 or not is besides the point. If a conversion fails when the server is looking for a file, this should just result in 'document not found'.

My guess is that because mod_fileiri is implemented to run in the 'fixup' phase (if the file is found otherwise, no need to use the module), an earlier '500 internal server error' will not allow it to come into action.

In my opinion, Apache on Windows should be fixed to return 'file not found' rather than an internal server error for non-UTF-8 files/directories. Everything else I think is a serious bug. If you file a bug on this, please tell me, and I'll support it. My guess is that if this is fixed, mod_fileiri should work.

From a user viewpoint, using UTF-8 for the whole wiki should solve the problem, and is a good choice for many other reasons.

And of course, Firefox and friends should be fixed to deal with IRIs, too.

Some more about mod_fileiri from a followup message from Martin:

mod_fileiri can do this too, indeed it's what it was originally designed for. It can also have the files in a legacy character encoding, and get requests in that character encoding, but reply with a permanent redirect to the UTF-8 version of that filename, and then reply with the actual document to a UTF-8 version. I.e. you can pretend you already switched to UTF-8 even if you haven't done so. Implementation of this working mode was quite tricky, to avoid loops and other confusions smile .

Clearly, if you can install mod_fileiri on your Apache server, it's a good solution to the URL character encoding issue for all web applications, though probably not necessary for TWiki.

-- RichardDonkin - 13 Dec 2004

I've now logged this as ApacheBug:32730 - you can monitor or support this by signing up to Apache Bugzilla, but please read their bug writing guidelines first and be polite smile

Note that MozillaBug:261934 makes it painful to work around this by setting Firefox 1.0 to UTF-8 encode all URLs - although the setting in about:config is persistent, it is ignored on startup, so you have to set it to false again, then true.

-- RichardDonkin - 16 Dec 2004

Fix record

I've looked at the Apache 2.0 code and commented on ApacheBug:32730 - the real fix was to stop Apache converting certain environment variables to Unicode.

The fix was to add PATH_INFO to the conditional added for ApacheBug:9223 at line 529 in mod_win32.c (SVN). Details on ApacheBug:32730.

-- RichardDonkin - 20 Dec 2004

FrancisLee had this problem with a non-TWiki application using Apache 2.0, see above #Example_Error. If anyone else has this problem, whether using TWiki or not, please comment here and on ApacheBug:32730 (and vote on the latter!)

-- RichardDonkin - 06 Jan 2005

Good news - Will Rowe of the Apache team has accepted the patch and has said he'll commit it to Apache 2.0.53-dev and 2.1-dev. See ApacheBug:32730 for his comments.

-- RichardDonkin - 07 Jan 2005

My first patch didn't quite work, see the bug report page for a revised patch that should fix this.

-- RichardDonkin - 10 Jan 2005

My new patch has now been applied to the Apache 2.1 code. Please vote on ApacheBug:32730 to get this patch applied to 2.0!

-- RichardDonkin - 10 Feb 2005

It appears from the Apache SVN repository that ApacheBug:32730 was fixed in the 2.0.x branch (SVN r153677) - so Apache 2.0.54 should include this fix. For anyone who needs a fix before then, apply the latest patch from the Apache bug report page.

To check out the Apache HTTPD 2.0.x branch, use:

svn co http://svn.apache.org/repos/asf/httpd/httpd/branches/2.0.x httpd-2.0.x

-- RichardDonkin - 31 Mar 2005

Apach 2.0.54 is now released, including the fix for ApacheBug:32730 - this Apache release is recommended for anyone using TWiki I18N on Windows. From the Apache changelog for 2.0.54:

  • mod_win32: Ignore both PATH_INFO as well as PATH_TRANSLATED to avoid hiccups from additional path information passed in non-utf-8 format.
    [Richard Donkin <rd9 at donkin.org>]

-- RichardDonkin - 28 Apr 2005

I still have the problem with Win32 Apache 2.0.54 and ActivePerl 5.8.4. cgi-bin/printenv.pl tells that the environment variable PATH_INFO is already garbled when Perl code starts.

Maybe prep_string() in mod_win32.c doesn't work with my default code page.

Is there a way to get around PATH_INFO, e.g. extracting path info from REQUEST_URI?

-- KaoruMaeda - 01 Jun 2005

I added this in setlib.cfg and now it seems working.

   # -------------- Only needed to work around an Apache 2.0 bug on Win32
   if (defined($ENV{'PATH_INFO'}) &&
       defined($ENV{'REQUEST_URI'}) &&
       defined($ENV{'SCRIPT_NAME'}) &&
       $ENV{'REQUEST_URI'} =~ /\%/) {
      my $req = $ENV{'REQUEST_URI'};
      my $scr = $ENV{'SCRIPT_NAME'};
      my $path = $req;
      if ($path =~ s/^\Q$scr//) {
         $path =~ s/\?.*//;
         $path =~ s/\%([0-9a-zA-Z][0-9a-zA-Z])/chr(hex($1))/ge;
         $ENV{'PATH_INFO'} = $path;
      }
   }

I also changed Encode::encode call in TWiki.pm. &FB_PERLQQ causes an error and it should be written as FB_PERLQQ()

-- KaoruMaeda - 14 Jun 2005

Not sure why this is not working for you - unfortunately I was not able to test my patch because Apache for Windows only builds with Visual Studio tools that I don't have. The Apache patch appears to be partially working, since the TWiki scripts actually run - previously they were prevented from running with an Internal Server Error (500).

I haven't got any time at present to look at this, having just moved house and started a new job, but if you could attach details of your testenv HTML output, relevant Apache log file entries, your Apache version and setup including default code page, and so on, that may help someone else to figure this out.

From looking at testenv output, specifically for its PATH_INFO test, the following environment variables should perhaps also not be converted to UCS-2 (though TWiki may not require all of these):

REDIRECT_URL (as mentioned above)
SCRIPT_URI
SCRIPT_URL

-- RichardDonkin - 17 Jun 2005

Re Kaoru's issue, this needs some more investigation, but at least Apache 2.0.54 or higher lets the TWiki code run, so I think this problem can be considered mostly resolved, with patch available. Updating InternationalisationIssues.

I've commented on a related bug, ApacheBug:34985, and have exchanged email with an Apache developer who works on the Windows version, but I don't think this bug affects TWiki users.

-- RichardDonkin - 12 Nov 2006

Edit | Attach | Watch | Print version | History: r24 < r23 < r22 < r21 < r20 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r24 - 2008-09-02 - TWikiJanitor
 
  • Learn about TWiki  
  • Download TWiki
This site is powered by the TWiki collaboration platform Powered by Perl Hosted by OICcam.com Ideas, requests, problems regarding TWiki? Send feedback. Ask community in the support forum.
Copyright © 1999-2017 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.