Internationalisation Enhancements
This page is targeted at anyone interested in the development of internationalisation support for TWiki. If you are looking for instructions on configuring your TWiki to work with your local language, see
InstallationWithI18N
- One code base for the world
- English is just another language
Slogans borrowed from the Mozilla I18N
project
For help in updating plugins or core code for Internationalisation, see
InternationalisationGuidelines (

)
This page is a gateway and discussion point for developers working on
I18N. It is mainly a collection of resources useful to such
developers.
Related pages: InternationalisationDiscuss,
InternationalisationIssues,
UnicodeSupport (

),
InternationalisationUTF8,
ProposedUTF8SupportForI18N,
EncodeURLsWithUTF8,
CyrillicSupport,
JapaneseAndChineseSupport,
UserInterfaceInternationalisation,
UserInterfaceLocalisation,
BiDirectionalText (

)
Introduction
The TWiki code, since
TWikiRelease01Feb2003, has good support for internationalisation ('I18N') in some key areas. This is primarily to support 8-bit character sets, such as ISO-8859-15 and KOI8-R, though this also helps today with multi-byte character sets such as EUC-JP, and will help with Unicode in the longer term (see below for Unicode links, and
InternationalisationUTF8 for discussion).
The key feature is support of 8-bit characters in
WikiWords, ensuring that they are auto-linked, displayed and sorted as required. A locale-aware version of
grep will be necessary for searching to work properly - GNU
grep works fine and is available on virtually any platform.
Use of locales is controlled by a
configure setting, and the locale is site-wide for simplicity. More complex setup of locales may be possible in future, but there are security issues with allowing web users to set their own locale variables.
The Move to Unicode
Unicode and
UTF-8
support were out of scope for the initial work in late 2002, because the whole area of Unicode is much more complex. UTF-8 support has been investigated, and implemented for URLs in
EncodeURLsWithUTF8, but full UTF-8 support is currently on hold because of lack of time, some technical difficulties (see
ProposedUTF8SupportForI18N), and the greater importance of
UserInterfaceInternationalisation. However, East Asian sites successfully run TWiki with UTF-8 for Japanese, Chinese, Korean, etc. It is to be hoped that
I18N may come back into scope in 2008.
Things to do
Multi-byte character support
A few multi-byte character encodings other than Unicode do work already, specifically EUC-JP, EUC-KR, EUC-TW and EUC-CN. All other character encodings, including Shift-JIS, Big5, GB2312, GBK, UHC, Johab and others, will
not work due to their not being 'ASCII safe' - some East Asian characters include ASCII bytes, usually as the second byte, that can be confused with special TWiki characters.
Quite a few bugs relating to use of Chinese and similar languages with TWiki have been fixed - see
InternationalisationIssues.
UTF-8 support:
ProposedUTF8SupportForI18N is under development and UTF-8 can be used in a limited way today as the site character set for East Asian sites, i.e. Chinese, Japanese, Korean and Vietnamese, without
WikiWord support beyond ASCII (which is not so important anyway for many East Asian languages).
UserInterfaceInternationalisation describes the work done on a framework to enable localisation of the TWiki user interface (i.e. translating English language text, wherever it appears in templates or other parts of the TWiki user interface, into another language). Through use of
CPAN modules for
L10N, it should be quite easy to translate TWIki into another language - see
UserInterfaceLocalisation for details.
Multilingual TWiki
Enabling multiple languages within a single TWiki page (e.g. French, Russian and Japanese) - this should be implemented through Unicode (specifically
UTF-8 support), combined with the
UserInterfaceInternationalisation already implemented in
DakarRelease (which allows each user to select their preferred translation of the TWiki user interface).
Specific requests:
Older TWiki releases
Browser setup
TWikiRelease01Sep2004 and later do
not require any browser setup. If you have an older TWiki version the following may help.
-
: For users of Firefox, Mozilla/SeaMonkey and similar browsers: you can optionally configure your browser for UTF-8 URLs as follows so that non-ASCII characters display properly in the URL bar and when mousing over links. This is not necessary for TWiki to work, but looks nicer. Instructions given are for Firefox 1.0:
- Type
about:config into the URL bar
- Type
utf into the filter field that appears
- Double-click on the
network.standard-url.encode-utf8 line so that it says true
- Double-click on the
network.standard-url.escape-utf8 line so that it says true
- This ensures that UTF-8 URL encoding is used for all URLs - note that this does not mean your site needs to use UTF-8. See EncodeURLsWithUTF8 for more details.
- History:
- TWiki's Dec 2001 release could link to WikiWords with 8-bit characters in their names, as long as you use
[[WikiWord]] type links.
- TWikiRelease01Feb2003 was the first release with full I18N support for 8-bit WikiWords.
- In both these releases, you needed to disable UTF-8 (Unicode) encoding of URLs by the browser (which is enabled by default in some browsers):
- InternetExplorer 5.0 or higher: in Tools | Options | Advanced, uncheck 'always send URLs as UTF-8', then close all IE windows and restart IE. (No changes needed for IE 4.0)
- OperaBrowser 6.x or higher: in Preferences | Network | International Web Addresses, uncheck 'encode all addresses with UTF-8'.
- MozillaBrowser 1.x, Netscape 6+ and related browsers: no setup necessary (tested on Mozilla 1.1 and K-Meleon 0.7) - but see the skin/template changes below
- Netscape 4.x: works fine in general without any setup changes, but may have some problems as in BrowserProblemWithUmlauts (tested briefly on Netscape 4.7) and may 'burp'
on loading page
- Lynx 2.8: no setup necessary (tested briefly on Lynx 2.8.4rel.1 with CygWin)
- Once you've done this, the following should link to an existing page called called LaLangueFrançaise, with the topic name appearing correctly: LaLangueFrançaise - written using
[[Sandbox.LaLangueFrançaise][LaLangueFrançaise]]
Skin and template changes
In
TWikiRelease01Feb2003, some minor skin/template changes are needed to support use of forms with
Mozilla and the
%CHARSET% variable with any browser. The standard TWiki templates are now fixed in the
TWikiAlphaRelease to work with
I18N web names and
WikiWords using Mozilla - this is because Mozilla decides to UTF8-encode URLs if they are used as a form submission URL, even though the whole page is in ISO-8859-1 mode and other URLs are never encoded...
To make any skin work with the new
I18N support, some simple changes are needed to any form submission URLs:
- Locate any
<form> elements in your skin templates - e.g. grep -i '<form' *.tmpl under Unix/Linux/CygWin.
- Change the form submission URL (usually on same line as the
<form tag, and always part of the action="http://foo" attribute) so that the variables %WEB%, %BASEWEB%, %INCLUDINGWEB% and %TOPIC% are properly URL encoded. For example, to URL encode .../%WEB%/%TOPIC% write .../%INTURLENCODE{"%WEB%/%TOPIC%"}%.
- You only need to make this change for form submission URLs - any other URLs don't need to change, e.g. those used for normal links
- (
) Be sure to use =, not = - this helps to ensure that your skin will work smoothly in the future, when TWiki eventually supports UTF8 throughout.
To support character set selection (which enables any character set to be used in the skin and topic contents), skins should use the following
HTML using the new
%CHARSET% variable instead of
iso-8859-1:
<head>
...
<meta http-equiv="Content-Type" content="text/html; charset=%CHARSET%" />
</head>
Skins for
TWikiSyndication should use names of the form 'rss*' - this ensures that the TWiki code knows it is handling RSS data, which requires
I18N characters (i.e. with 8th bit set) to be encoded as
&nnn; sequences.
- Entities written as numeric character references ((NCRs) such as
&1562; are drawn from the Unicode (ISO 10646-1) character set, whose first 255 codepoints are the same as ISO-8859-1. These entities always refer to the same character, regardless of the document's character encoding, according to the HTML 4.0 spec
.
- (
) See Plugins.InternationalisingYourSkin for more discussion of how to internationalise skins.
History - I18N-related TWiki pages
8-bit Wiki words etc
8-bit Interwiki
8-bit external programs
I18N of search results
Selecting browser character set
I18N resources
TWiki sites
These sites are either about
I18N or using TWiki
I18N features - some old sites using
I18N may require UTF-8 URL encoding to be turned off in your browser as per
#BrowserSetup, but those using
TWikiRelease01Sep2004 or later will not:
Character encodings for internationalisation
Pre-Unicode character encodings
- PhpWiki:ISO-8859-1
- PhpWiki:ISO-8859-15
- KOI8-R
- used in Russia
- CPAN:Encode::Supported
has a very complete list of 8-bit encodings, including vendor variations
- Also includes the RFC:2047
MIME header quoted printable and Base64 encodings used in emails etc, e.g. =?iso-8859-1?q?=20Smith?=
-
libiconv NOTES file
- good overview of the character codings actually used world-wide
- Great articles on Japanese encodings, including mention of the ISO-2022-JP characters that clash with HTML pages and TWikiML syntax:
-
- very good articles on characters and encodings
, particularly ISO-8859-* and Windows-1253, with some Unicode articles as well.
- Chinese XML FAQ
covers some issues - in particular, Q18 states that Big5 is not safe as a site character encoding, for same reasons as ISO-2022-* and HZ-*
Unicode and UTF-8
Scripts and languages
- Ancient Scripts
- good coverage of non-Roman writing systems, many of which are still used today despite name of site
- Language Introductions
- tutorials on writing in Russian, Korean, etc.
FAQs and Guides
Other Wiki i18n efforts
Many other leading Wikis already have i18n features:
Useful newsgroup threads
Updates
One issue is that many locale setups are somewhat broken, particularly on Windows. On Debian GNU/Linux, the
\w regex in the
fr_FR.ISO8859-1 locale matches '-' as well as '_', which is a minor issue, while on
CygWin there is no locale support at all, and on
ActivePerl, uppercasing a character can lead to a completely different and even non-alphabetic character! In Perl 5.8 on another Debian system, using the locale
fr_FR.UTF8 meant that the collation order was as for ASCII, and a Japanese (Kanji) character was included in the set of alphabetic characters...
This means that workarounds will be essential for many people, so this code will make it easy to avoid using any locale functions if
$useLocale is turned off - basically, this will involve typing a list of upper and lower case non-ASCII national characters into
TWiki.cfg variable settings. This will help with features handled entirely by TWiki, such as
WikiWords, but won't address external programs, for which the only solution is to report the bugs to whoever maintains them, or perhaps install different versions of such programs.
UPDATE: I've coded most of this - all the basic link types are working, apart from anchors and upper casing in spaced-out
WikiWords. There's a test page up at
http://donkin.org/bin/view/Test/TestTopic5
running on this code - not yet in
TWikiAlphaRelease as I'd like to test it a bit more, but it seems to work OK. It's been tested in no-locale mode only so far, so will work on broken locales. I really need Perl 5.6 on a system with working locales to test this - will probably have to install Perl 5.6 on Debian.
--
RichardDonkin - 26 Nov 2002
I've now got sorting of the
WikiWords in
WebIndex working - turns out that
ls on my Debian is locale-unaware, but TWiki sorts the output anyway in Perl, so it works with only a five line change to Search.pm. Locales are also working fine under Perl 5.005_03.
--
RichardDonkin - 29 Nov 2002
Now in
TWikiAlphaRelease - please test this out and log any bugs! It's quite easy to set up if you have a working locale on your system. Be sure to review
#Browser_setup for a simple browser config change required for this to work.
--
RichardDonkin - 30 Nov 2002
More links about what other Wikis are doing in this area -
PhpWiki is quite a way ahead, in that it actually ships with translated pages for several languages and already supports
PhpWiki:DoubleByteCharacters
.
MoinMoin also ships with translated pages and has Unicode character support.
--
RichardDonkin - 02 Dec 2002
Now released as part of
TWikiRelease01Feb2003 and running on TWiki.org (with
I18N turned off).
(Discussion refactored to InternationalisationDiscuss; any bugs should be reported via BugReports as normal, and linked from InternationalisationIssues as well.)
--
RichardDonkin - 16 Feb 2003
Mainframe (EBCDIC) and UTF-8 support
Update on recent work:
--
RichardDonkin - 11 Sep 2003
Localisation (L10N) of TWiki
Added link above about
KwikiWiki 's
L10N - localization
Kwiki
. See the links at the end - Kwiki uses standard
CPAN modules for that.
What is the difference between
L10N and
I18N?
--
PeterMasiar (cannot copy-paste sig?)
I18N stands for internationalization (I + 18 chars + N).
L10N stands for localization (L + 10 chars + N). Internationalization makes an application ready to be localized into different languages. That is,
I18N is the base, making sure the app can handle character sets in multiple languages and provides a framework handling language specific text and formatting (e.g. externalized language files).
L10N into a different language is a relatively simple task for an app that has a solid
I18N framework.
--
PeterThoeny - 14 Sep 2003
There are some links on
L10N of TWiki in an
earlier section (now added to the TOC). There are some people interested in doing translations of TWiki, which would involve development of the infrastructure to support
L10N - currently, TWiki
I18N is aimed at page editing and display rather than at
L10N of TWiki's text output, but that could change if a Perl developer starts writing some patches for this.
--
RichardDonkin - 14 Sep 2003
Localisation: translation of TWiki documentation
See TranslationSupport for more recent discussion.
UTF-8 support in URLs (
!)
Significant progress has been made here, so you can now use UTF-8 URLs with virtually any site character set - see
EncodeURLsWithUTF8. Now in
TWikiAlphaRelease.
--
RichardDonkin - 19 Jan 2004
New for Dakar
- Strikeout of Edit and Attach links in edit and preview pane made language-insensitive
- Made the "Add form..." and "Replace form..." buttons configurable in templates -- TW
Some refactoring of this page to reflect current work on
UserInterfaceInternationalisation being done for
DakarRelease and highlight optional
FirefoxBrowser setup for more readable display of URLs in the URL bar, and remove or de-emphasise historic info.
--
RichardDonkin - 03 Oct 2005
I'm just wondering if there is a posiblity to disable language-selection through brower-identification and just stick to english. Is there a variable to complete disable that stuff?
--
GerdMeison - 04 Nov 2005
yep; {UseInternationalisation}
--
CrawfordCurrie - 04 Nov 2005
I'm sorry, Crawford, but a "grep -R
UseInternationalisation" on my dakar-install doesn't find anything. In which file should that be written? My wiki has a default internationalisation on the user-interface. It's only that part which I want to have always in english.
--
GerdMeison - 04 Nov 2005
{UseInternationalisation} is an option in the
configure interface. Check out lib/TWiki.cfg
--
AntonioTerceiro - 04 Nov 2005
No, I mean, it was just renamed to
{UserInterfaceInternationalisation} (in
SVN).
--
AntonioTerceiro - 06 Nov 2005
Does working (i18n) code exist for capitalizing
wiki word to
WikiWord?
--
ArthurClemens - 22 Mar 2006
There's some code in
SVN:TWiki/Render.pm
that looks like this - it will work with
I18N as long as locales are properly set up, but it probably won't work in 'locale regexes off' mode:
# Turn spaced-out names into WikiWords - upper case first letter of
# whole link, and first of each word. TODO: Try to turn this off,
# avoiding spaces being stripped elsewhere
$theTopic =~ s/^(.)/\U$1/;
$theTopic =~ s/\s([$TWiki::regex{mixedAlphaNum}])/\U$1/go;
So this is something of an
I18N bug - requires code that uses
upperNational and
lowerNational to do upper-casing, which is not trivial since some lower case letters don't exist as upper case (e.g. German
ß). Probably not worth fixing unless someone has this issue and the time to fix it.
--
RichardDonkin - 30 Mar 2006
I installed the twiki
DakarRelease. But I found that the Chinese topic title can not display correctly. Moreover, it make the page format wrong. I copy the page
TWikiQickStart on stlchina
from
http://www.stlchina.org
(a chinese twiki site). But it cannot display the same thing as it on stlchina. Please check the attached file for detail.
--
ZhengLingxiang - 05 Apr 2006
It's best if you create a new support request under the
Support web. See
SupportGuidelines on how to do this.
Your
raw.txt attachment is quite interesting - it is using either GB2312 or GBK character encoding. Neither of these is supported by TWiki (see
JapaneseAndChineseSupport for details) since there are some Chinese characters that include ASCII characters that are processed (parsed) by TWiki (e.g.
[), which will cause your page text to be displayed incorrectly.
From your
configure.htm output, it seems you are using UTF-8, which explains why pasting in text in GBK didn't work.
--
RichardDonkin - 05 Apr 2006
The text is save in utf-8 format in the wiki page. If I just perview the topic when edit, all thing works fine. But after I saved it, the page cannot bed displayed properly. The raw.txt in GBK, just because I save in this format.
--
ZhengLingxiang - 05 Apr 2006
I'll need more information to help further - the exact error case you are seeing needs to be clearly explained. I don't read Chinese, so please be very specific as to exactly which characters don't work.
SupportGuidelines is a good place to start.
--
RichardDonkin - 05 Apr 2006
I do some more test and create a new support page
ChineseHeadlineBrokenPageFormat
--
ZhengLingxiang - 06 Apr 2006
As far as I see it site lang doesnt get used. I changed line 141 to:
my $userLanguage = _normalize_language_tag($session->{prefs}->getPreferencesValue('LANGUAGE')) | $TWiki::cfg{Site}{Lang};
now it will use the site lang if there is no user pref
--
AdamHyde - 08 May 2008
Correct - and it has been removed.
--
CrawfordCurrie - 31 May 2008
The Lang (more recently Site Lang) was intended for future use when we eventually supported multiple languages, but this was never implemented.
--
RichardDonkin - 14 Jun 2008