create new tag
, view all tags

Feature Proposal: INCLUDE not to corrupt characters when reading from a web site employing a different charset


There are cases where your TWiki site employs UTF-8 but you need to include a page from an ISO-8859-1 web site and vice versa.

Description and Documentation

TWiki::_includeUrl() is to do charset conversion if needed. Currently, it's not done and character corruption may occur. The conversion code would be something like:

$text = Encode::encode($TWiki::cfg{CharSet}, Encode::decode($charsetOfIncludedText, $text))
The actual code would have a logic to determine the site charset reliably and a logic to determine the charset of the included text.

As you may have noticed, this is not a "character set conversion". It's merely an encoding conversion. It's to avoid character corruption causing by "charset" differences.

Let's say:

  • You have a TWiki site employing UTF-8 as its site charset.
  • You have a web in Japanese on the TWiki ste.
  • In a topic of the web, you include an external web site, which happened have non-UTF-8 charset.
The chance for this situation to happen is not small. Because there are a good number of web sites on the internet employing charsets Shift_JIS and EUC-JP.

This shows you that even if you are dealing with one character set (in this case Japanese character set), you need to do charset conversion (accurately speaking, it's merely a character encoding conversion) to see the included text on the TWiki topic as expected.




-- Contributors: HideyoImazu - 2012-08-28

Care must be taken not to load additional modules unnecessarily.


I think this brings up a bigger question: Should different char sets be supported in a TWiki site? If so, on a page basis? That would require adding char set info to the topic meta data.

I am not sure it makes sense to support char sets on a per page basis in a life twiki installation. However, this issue comes up when consolidating TWikis from different installations (such as when a company gets acquired) or when data is restored from backup. For these cases it would be useful to have the char set info in the topic meta data. I listed this as a possible enhancement of the BackupRestorePlugin to make restore aware of the char set, and translate if needed.

Example how topic meta data could look like with char set info:

%META:TOPICINFO{author="PeterThoeny" date="1345672962" format="1.2" charset="utf-8" version="5"}%

-- PeterThoeny - 2012-08-29

This is not quite about character sets. This is actually about character encoding.

Standards, specification, notions, and terminology originated from or influenced by the MIME RFCs are confusing. But what charset=XXXX specifies is a combination of a character set and an encoding. For example, putting minor nuances aside, charset=Shift_JIS and charset=EUC-JP are for the same Japanese character set. They represent a character code point in different byte combinations. You can convert one from the other arithmatically without using a big table.

-- HideyoImazu - 2012-08-29

Accepted by JerusalemReleaseMeeting2012x08x31.

-- PeterThoeny - 2012-08-31

I just bugfixed it for myself. We are running a TWikiRelease04x01x02 and I have used the Encode package (because UTF82SiteCharSet did not work for me). I thought this patch might be helpful information.

--- TWiki.pm	2012-10-06 16:22:13.454392097 +0200
+++ TWiki-4.1.2/lib/TWiki.pm	2007-03-03 15:45:57.000000000 +0100
@@ -44,7 +44,6 @@
 use strict;
 use Assert;
 use Error qw( :try );
-use Encode;
 require 5.005;       # For regex objects and internationalisation
@@ -1714,34 +1713,6 @@
         if( $httpHeader =~ /content\-type\:\s*([^\n]*)/ois ) {
             $contentType = $1;
-        # <charset-patch>
-        # Note.  $text = $this->UTF82SiteCharSet( $text );
-        #        fails. It does not return anything for a UTF8 page.
-        my $inputCharset = '';
-        # retrieve charset from HTTP header
-        if( $contentType =~ /; ?charset=([-a-z0-9]+)/ois ) {
-            $inputCharset = $1;
-        }
-        # retrieve charset from <head>
-        if ( !$inputCharset ) {
-            if( $text =~ /; ?charset=([-a-z0-9]+)/ois ) {
-                $inputCharset = $1;
-            }
-        }
-        $inputCharset =~ s/utf-8/utf8/g;
-        $inputCharset =~ s/latin([0-9]+)/ISO-8859-\1/g;
-        warn "Converting from ".$inputCharset." to ".$TWiki::cfg{Site}{CharSet};
-        if ( $inputCharset and $inputCharset ne $TWiki::cfg{Site}{CharSet} ) {
-            $text = Encode::encode($TWiki::cfg{Site}{CharSet}, Encode::decode($inputCharset, $text));
-        }
-        # </charset-patch>
         if( $contentType =~ /^text\/html/ ) {
             $path =~ s/[#?].*$//;
             $host = $protocol.'://'.$host;

-- LukasProkop - 2012-10-06

Lukas, this topic is about enhancing the latest version of TWiki. Your issue is related but not subject to this topic because your issue is neigther about the latest version nor including an external content whose charset is different from the TWiki site's charset.

Please go to http://develop.twiki.org/~twiki4/cgi-bin/view/Bugs and report your issue together with the Perl version you are using.

-- HideyoImazu - 2012-10-08

@Hideyolmazu: My reason for posting here is that the bug was already reported (Item6942) and yes my patch works with an out-dated version. Therefore I thought I'd prefer to post it where Google points to rather than trying to contributing to the issue with such an unbeautiful hack.

Anyway, I am perfectly fine with deleting my posts here, if they are not appropriate. Thanks for fixing the formatting smile

-- LukasProkop - 2012-10-10

Edit | Attach | Watch | Print version | History: r11 < r10 < r9 < r8 < r7 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r11 - 2012-10-10 - LukasProkop
  • Learn about TWiki  
  • Download TWiki
This site is powered by the TWiki collaboration platform Powered by Perl Hosted by OICcam.com Ideas, requests, problems regarding TWiki? Send feedback. Ask community in the support forum.
Copyright © 1999-2018 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.