Tags: view all tags

Understanding Encodings

Why do we have such a hard time with international character sets? As more of us start to understand how they work, the implications of support for UTF-8 (and other encodings) are becoming more apparent.

Some or all of the following may be wrong; I'm sure you'll correct it silently by editing the text, and adding your name to the contributors, rather than adding discussion.

InternationalisationEnhancements and InternationalisationUTF8 have some very useful background links on I18N and UTF8 - recommended reading. Please include the most useful links here under Resources.

Primer
- Character Sets and Encodings
- Browsers
Perl
TWiki
- {Site}{Locale}
- {Site}{CharSet}
Issues, problems and directions
Resources
Discussion

Primer

Character Sets and Encodings

When talking about character sets, we have to bear in mind the fact that the basic storage element in a file on disk is a byte (8 bits). We have to be able to represent all possible characters in all the worlds' character sets using bytes alone, so they can be stored in files.

If you put your imagination into gear, you can picture all the characters in all the languages in the world laid out in a big long line. If you then assign an integer code to each character, you can see we have a way of addressing every character with one integer code. This mapping is referred to as unicode.

The ASCII character set defines the first 128 of these integer codes. Characters from the ASCII character set are referred to as 7-bit characters because they can all be represented using the lower 7 bits of a byte.

When bit 8 of a byte is set, another 128 characters (128..255) are available within a single byte. These are referred to as the high bit characters. In Western European 8-bit character sets, high bit characters are used for accented characters (umlauts, cedillas etc) and some standard symbols. When added to ASCII, this single-byte range covers many Western alphabets.

As soon as we need more characters we have a problem; there is no space left in a byte to represent any more characters. We have to move to a multi-byte encoding. There are several of these, particularly for Asian character sets where the distinction between character set and encoding is important. Here, the one we are most concerned with is UTF-8, which is a character encoding for Unicode. UTF-8 is a variable-width encoding, meaning that single encoded characters can be anything from 1 to 4 bytes in length.

For example, the character represented by the character code 32765 is way outside the range that can be represented by ASCII. So the number 32765 is encoded into 3 byte values as: 231 191 189. See the Wikipedia article on UTF-8 for all the gory details.

Note that correct interpretation of this encoding depends on our knowing in advance that this string of bytes encodes a single character. There is nothing inherent in the string to say whether these bytes represent character 32765, or the three high-bit characters 231, 191 and 189 - or even some other completely different character in a different encoding. Hence the importance of knowing the precise encoding used for any piece of text.

A string of bytes that is perfectly legal in one encoding may represent an illegal character in another encoding, and cause TWiki (perl) to crash (segfault) when a topic containing that string is read.

Browsers

All the protocols used on the web support the specification of the encoding used in the content being transferred. For text-based media types like HTML, HTTP defaults to an encoding of ISO-8859-1 (RFC:2616, section 3.7.1). A different encoding needs to be explicitly specified in the HTTP Content-Type header.

Note that the encoding cannot be reliably specified within the content, because applications (like web browsers) need to know the encoding in advance to understand the content. Most browsers accept a meta element in HTML to specify a nonstandard encoding, but this only works if the encoding is identical to ASCII (and hence ISO-8859-1) for the characters <, m, e, t, a, and so on.

Languages may have their own defaults for encoding. For example, XML, and by inference XHTML, defines UTF-8 as the default encoding.

In any case, relying on defaults is not very useful, which is why TWiki specifies the charset in the HTTP Content-Type header. This helps ensure that the browser makes the right decision rather than guessing.

URLs

URLs are generally submitted in an HTTP request (GET, POST, etc) in one of two formats, both of them URL-encoded. The URL format used is completely dependent on the browser, not the web server or the web page's character encoding:

UTF-8 - this is used regardless of the web page's encoding - in current TWiki versions, the EncodeURLsWithUTF8 feature maps the URL back into the site charset. This is the default in InternetExplorer, OperaBrowser, Firefox 3.x onwards, etc.
Site charset (e.g. ISO-8859-1, KOI8-R for Russian, etc) - this is the default in older browsers such as Lynx, and also in Firefox 2.0 and earlier.

The EncodeURLsWithUTF8 code must dynamically detect the character encoding of the URL - while detecting charset encodings is a hard problem in general, it's not too hard to discriminate between UTF-8 and 'other', as UTF-8 has a distinctive format. Hence this code uses a regular expression that detects a valid encoding, ensuring TWikiSecurity by rejecting 'overlong' UTF-8 encodings.

In fact, the URL can be received either by TWiki or the web server software (such as Apache). URLs for TWiki topics are handled by TWiki, but URLs for attachments normally go direct to Apache. It's important that the encoding used for the pathname for the attachment matches the encoding of the URL for viewing the attachment, otherwise the attachment can't be used. The EncodeURLsWithUTF8 feature has some code to handle this.

Forms and JavaScript

Internally, most browsers represent the characters being displayed using Unicode. When a user types into an input field, those characters are captured as Unicode characters.

All JavaScript operations on characters assume Unicode, since the JavaScript language uses Unicode internally.

When displaying the web page, the browser will either:

heuristically guess the encoding for the data, set by the server using the HTTP Content-Type header, HTML meta tags, and the HTML form tag (which may have an enctype).
using the explicitly configured encoding set by the user (generally when the heuristics are wrong)

When an HTML form is submitted, the data for that form is generally submitted using the encoding specified by the server when generating the web page. However, there are no standards in this area (see InternationalisationUTF8 for links), so the browser could do whatever it wants in theory. Hence, the browser normally converts its internal Unicode characters to the web page's encoding before sending them as POST or GET data to the web server. Since a POST or GET includes a URL, that's subject to the way that URLs are encoded, which may differ from the form data encoding - see the #URLs section above.

For example, a � (Euro) symbol is represented in iso-8859-15 (not ISO-8859-1) by code point 0x80 (128). Let's say we have a page containing a form, and using iso-8859-15. The page contains the character code 0x80 in the default value for the 'Currency' input field in the form. When the page is loaded, the browser maps this character to the unicode code point 0x20AC, which is the unicode position for a euro character. If the user types a euro character, it is the unicode character that is captured in the form. However when the form is submitted, the browser silently converts 0x20AC (Unicode) back to 0x80 (ISO-8859-15) before sending the data to the server. Thus a server using forms can assume that if it imposes a character encoding on a web page, then form data returned to it will also use that encoding.

Since the TWiki raw editor defined the encoding in the page and uses a textarea in a form, this means that the site character set appears to be used throughout.

XmlHttpRequest (XHR)

XHR doesn't do any decoding of the data you ask it to send back. Since JavaScript uses Unicode internally, if you use XHR from JavaScript to send the same form data as in the raw editor example, the browser will send it UTF-8 encoded, irrespective of the encoding specified for the web page. Thus our euro symbol gets sent as 0x20AC. It is left to the server to know that this 0x20AC should be mapped back to 0x80.

Fortunately there are only a few characters that need to be mapped this way. The table at http://www.alanwood.net/demos/ansi.html show the mapping from unicode to Windows 1252, which is one Microsoft ANSI character set (though it varies from the ISO-8859-* standards). This character set considerably overlapping code points with iso-8859-1 but is not identical. Any server that uses iso-8859-1 has to know that these characters (128..159) will appear as unicode code points in data submitted via XHR, and convert them back to the equivalent iso-8859-1 code points if necessary.

Developers can find (and are welcome to reuse) code to manage this mapping in the WysiwygPlugin.

Security

Since UTF-8 provides another way to get data into the server, it provides the potential for new security holes. It's possible to use 'overlong' encodings that are actually illegal but sometimes interpreted like a shorter encoding.

Perl

TWiki is written in Perl, and therefore subject to Perl's support for encodings. Starting from Perl 5.6.0, Perl was able to handle unicode characters requiring more than one byte, but the unicode support has improved dramatically with Perl 5.8, which is really the first Perl version with usable Unicode.

Perl will silently do some magic, but there are limitations: When reading from an external source like a file or a socket, Perl has no knowledge of the encoding used in the data it receives. As long as Perl just moves the data around, this does not matter at all, because it is just shifting the encoded bytes. However problems can arise if:

Perl is forced to interpret data as one encoding when it is in fact encoded using a different encoding,
data using one encoding are compared with data using another encoding.
Perl is asked to output multibyte unicode characters to output streams which are not enabled for the correct encoding ( Wide character in print is the typical error)

So, it is the job of the "environment" to know how data are encoded. Tough luck if all you have is a bunch of files on a disk: All you can do is either make assumptions or intelligent guesses (using utf8::is_utf8, for example). TWiki generally assumes that all data is encoded using iso-8859-1, unless you define a site character set in configure.

Perl has extensive (scary) documentation about its unicode support in its perluniintro and perlunicode man pages. For future guidance to TWiki development, two documents coming with 5.10 (but valid in current versions of Perl) are helpful: A tutorial perlunitut and a FAQ perlunifaq.

Perl also has a number of modules that can help with character set conversions.

Encode supports transforming strings between the Perl internal unicode representation and the various byte encodings (though be warned that it doesn't handle remapping out-of-range characters, such as the euro example discussed above).
HTML::Entities can be used to convert characters to a 7-bit representation (though users should be aware that decoding HTML entities always decodes back to unicode).

TWiki

TWiki assumes that:

All topics (names and content) on a TWiki site will use the same encoding,
Once you have selected an encoding, you will never change to a different encoding,
If you don't select an encoding you are going to be happy with iso-8859-1,
All tools used on topics have equivalent support for all encodings,
The same encoding is used for storing topics on disk as for transport via HTTP.

These assumptions are inherent in the fact that TWiki uses global settings (in configure) to determine the encoding to use.

TWiki provides the encoding, as given from the configure setting, to the browsers in both a Content-Type header and a meta element. However it is up to the developer to correctly decode parameters to XHR calls.

{Site}{Locale}

A locale is a string that specifies a language (e.g. fr), a region (e.g. CA for Canada) and associated character encoding to use. For example, 'fr_CA' specifies French in Canada, and 'ISO-8859-1' specifies the standard US 8-bit encoding (ASCII with the high bit characters representing some extra characters). Together they form a locale, thus: 'fr_CA.ISO-8859-1'. The language part is used for sorting (a.k.a collation order), e.g. for search results, where accented characters usually should be close to their unaccented counterpart though their numerical equivalent is quite different.

TWiki lets you set a locale using {Site}{Locale}. If you don't set a locale, it uses whatever the default is for your perl installation.

You are strongly discouraged from setting this to utf-8 in current TWiki versions - see next section.

Locales should be largely irrelevant to TWiki's use of encodings. However, you still need to consider them because:

Locales are useful to specify national collation orders which vary depending on locale - UnicodeCollation is an alternative to be considered, but that is more work.
Locales have been found to create very interesting bugs when combined with utf8 mode in Perl, hence Richard decided to abandon trying to combine locales with Perl in earlier work on UnicodeSupport.

{Site}{CharSet}

This is the critical setting for deciding what encoding TWiki is going to use. You are strongly discouraged from setting this to utf-8 when you first install TWiki for most sites (the only exceptions are sites using languages such as Japanese and Chinese where you don't care about WikiWords, or sites where you must mix languages with different 8-bit character sets). In versions of TWiki before UseUTF8 is done (e.g. 4.2), Unicode is not properly supported and many things will break in return for the ability to use UTF-8. Note: If you must do this, specify utf-8 and not utf8, which breaks on Windows (RD: what does this mean - Windows client or server?).

RD: Complete change above - utf-8 is NOT recommended for most current TWiki installations until UseUTF8 / UnicodeSupport is implemented. See InstallationWithI18N for more details.
RD: Actually this is intended to define only the browser charset (in original code) with the locale defining the charset/encoding that TWiki (and grep) uses internally on server, and of course they should match. I think this is still the case.
RD: This was originally derived from the locale if the CharSet was empty, and it would be much better if it had remained that way, as there's no point having this differ from the encoding in 99% of cases. I only created this setting in order to cover cases where the spelling of the encoding differs between the configured locales on the server ( locale -a output) and the browser - e.g. where it's utf8 on server and utf-8 in browser. Some such cases are covered in the code. This caused quite a lot of confusion, but in 99% of the cases, at least in the original Feb 2003 code, this setting was not needed - a later change broke the derivation of CharSet from locale setting. See InstallationWithI18N for some more detail here.

Issues, problems and directions

There are a number of major problems with encoding support in TWiki. It has to be borne in mind that the current I18N code was never intended to support Unicode, so we shouldn't be surprised that there are problems. In addition, lack of regression testing has resulted in some code rot over the years.

TWiki is difficult to set up for a specific encoding. The guides are inadequate.
The default encoding, iso-8859-1, was the right choice for TWiki in 2003; it's used by the vast majority of European languages, UTF-8 is much more complex, and there was no way we could go straight to Unicode, due to time available and the immature state of Unicode in Perl.
It was never intended for there to be any conversion to/from character sets within TWiki (with exception of the later EncodeURLsWithUTF8 feature), i.e. TWiki simply works in one character set for files, internal processing and web browser interactions, whether this is ISO-8859-1, KOI8-R (Russian) or EUC-JP (Japanese). However, later developers have tried to convert to and from site character sets, for reason that are not clear but may include partial support for UTF-8 - this has resulted in a lot of broken code.
TWiki doesn't remember the encoding used for a topic, which means that you can't simply trade topics between TWiki installations that might be using different character encodings.
If you try to read a topic written on an ISO-8859-1 site on a site that uses UTF-8, TWiki (or rather Perl) may crash (actually, it just exits with an uncaught Perl error). If you try to use a client-side tool (such as a WYSIWYG editor) without telling it the correct encoding, the tool may crash (which is the tool's fault). A crash isn't guaranteed because only a few 8-bit characters are valid as the first byte in a UTF-8 sequence, and this very pseudo-randomness has been the cause of many mysterious bugs.
I think this is a non-problem although it would be nice to include the encoding in the topic metadata. Crashes aren't a big surprise given that it was never a goal to swap topic data between installations like this. TWiki was only intended to work in a single character set - if the TWiki administrator wants to swap topics they are expected to convert pathnames and file contents to the right character sets.
TWiki support for character sets has been focused on using a single site character encoding since it was first introduced in Feb 2003, not on any conversion to/from other encodings. The more recent introduction of WYSIWYG editors such as TinyMCE, which work natively in Unicode, has somewhat broken this assumption. This is no problem if the site encoding can represent Unicode (although UTF-8 on the server is somewhat broken as mentioned), but the content will be silently mangled if it can't. Thus if you use Unicode characters in a Tiny MCE session and try to save them on a site that doesn't support Unicode, the characters may not be saved correctly.
Plugin authors don't know (or don't care) about character encodings, and some plugins can damage encoded characters, mis-interpret encodings, or other such nasties. This won't normally happen if they operate exclusively via the Func interface, but if they get content from elsewhere, it can fail. This won't go away without some concerted effort on adoption InternationalisationGuidelines and helping plugin developers somehow. Use of the default [a-z] type regexes will break with Unicode just as much as with ISO-8859-1.

So, what can be done about all this? Well, there are a number of possible approaches:

Only store HTML entities. HTML has it's own encoding for characters that are outside standard ASCII. Integer entities such as 翽 can be used to normalise all content to 7-bit ASCII, so negating the requirement for a site character set.
- ISO8859-1 plus Entities would be crash-proof, but require search patterns in many places to be entity-escaped where they are not today.
- RD: This would break quite a few things such as server-side search engines that index the topic files directly - they may well support UTF-8 but not Entities.
Fix the guides and the code. There are a lot of places where TWiki can break due to character set differences, but these can probably be tested for with no more than a doubling in the number of unit tests.
- RD: Not a good option, makes everything messier.
Standardise on UTF-8. If all topics were stored using a UTF-8 encoding, then all characters in unicode would be available all of the time.
- Would need a (batch or silently behind the scenes?) conversion from legacy charsets to UTF-8 to avoid crashes
- RD: Agree with this as the core approach, but we need to consider whether we provide backward compatibility at all. (See my comments on UnicodeProblemsAndSolutionCandidates)
Record the encoding used in a topic, in meta-data, to allow topics to be moved.
- This can only be done reliably for new topics.
- RD: This is a good idea in addition to moving to UTF-8 as the default. If we are supporting backward compatibility i.e. still allow a native ISO-8859-15 site, then it would be very useful to have this in the metadata. Getting a canonical name for encoding may be an issue - check the IANA list.

AFAICT from looking at other wiki implementations, standardisation on UTF-8 is the most popular approach.

See http://develop.twiki.org/~twiki4/cgi-bin/view/LitterTray/TestEncodings for a script that should help developers fighting with these problems.

Resources

Please add any useful external links on this topic here - InternationalisationUTF8, UnicodeSupport and InternationalisationEnhancements have some good links.

link here

-- Contributors: CrawfordCurrie, HaraldJoerg, RichardDonkin

Discussion

I could not find a place where TWiki tries to convert from the site charset to UTF-8. Standardising on UTF-8 would seem like a good idea, but would need a (batch or silently behind the scenes?) conversion from legacy charsets to UTF-8 to avoid crashes. ISO8859-1 plus Entities would be crash-proof, but require search patterns in many places to be entity-escaped where they are not today. And finally, recording the encoding in a topic can only be done reliably for new topics.

-- HaraldJoerg - 09 Apr 2008

Good content. I added related links at the bottom. It might be better to update InternationalisationGuidelines instead of creating this topic?

-- PeterThoeny - 09 Apr 2008

It is the most actual problem in TWiki. Many people turned from TWiki to mediawiki exactly for this reason. It seems to be the right choice to recode all existing topics to utf-8 and forget forever about storing topics in other encodings. The problem is that user names and wikiwords in utf8 and generated internal links should be well tested before and all the bugs should be eliminated. It is really a hard work.

-- SergejZnamenskij - 09 Apr 2008

This started out as a blog post capturing my learning about encodings, but it turned into something more over time. InternationalisationGuidelines is a cookbook, and should be updated when we have decided how to handle encodings in the future. For now, I changed this to a brainstorming topic.

An on-the-fly conversion to UTF-8 would work by assuming the {Site}{CharSet} is the encoding for topics if the contents are not valid UTF-8 strings. There is an existing regex that appears to be designed to detect UTF-8 strings, $regex{validUtf8StringRegex}, though such a test may be expensive.

The trick is (I think) to make TWiki normalise all strings as soon as possible, so that internal strings are always perl strings, irrespective of the encoding used in the source. The store should take care of converting the encoding on topic content and topic names. Because the encoding always accompanies an HTTP request, it should be possible to decode URL parameters at point of entry too - if this isn't already done in CGI. Encoded byte strings should never be allowed to bleed into the core. Testing is then a case of throwing encoded strings at TWiki from all angles, and making sure they all get converted to perl strings correctly.

One thing that bugs me is that RichardDonkin has alluded to performance issues, I think related to non-ISO8859, and we have to understand these issues before making any decisions regarding encodings. There's also been the implication that UTF-8 support in perl is incomplete; again, I'm not clear if this is still the case.

-- CrawfordCurrie - 09 Apr 2008

The more I think about it, the more I am convinced that the only sensible approach is to use UTF-8 exclusively in the core. Accordingly I am raising UseUTF8 as a feature proposal.

-- CrawfordCurrie - 12 Apr 2008

I'm VERY glad that SergejZnamenskij voiced out his opinion. While I strongly believe there are many who are silent with the same opinion, I think it's a good moment for decision-makers to seriously consider the future of TWiki. I remember many months ago, many were afraid of UTF-8 simply because it may cause breakage to the existing setup. I urge these people to seriously consider and evaluate what are the implications when moving to utf-8 through setting up a test environment and the report issues.

Sure, TWiki internal is not 100% UTF-8, but we must start somewhere.

That said, much thanks to Crawford in embarking on this (very challenging) journey!

-- KwangErnLiew - 25 May 2008

I think migrating TWiki to UTF-8 support only is the right way to go.

It ensures TWiki will work in all languages. It give ONE localization config to test.

But there is one TASK involved in this. If you take topics created in a locale like iso-8859-X then those topics will be pure garbage when viewed in utf-8 unless they are purely English.

So we need to design the right migration code from anything to utf-8.

You cannot search in a mix of utf-8 and none utf-8. At least not in practical. This means that most twiki applications that are using formatted searches will break unless you do a total conversion of topics from non-utf to utf-8.

Nothing prevents us from implementing something that does that. We have our configure in which we can build in service tasks including walking through all topics and convert anything that is not utf-8.

It can be tricky to know if something has been converted once. You do not want to convert the same file twice! But with META:TOPICINFO format we can control this in a safe way so a topic is only converted once and so that you can run a conversion again when needed.

But it is the right thing to do. Everything in utf-8 starting from TWiki 5.0.

For 4.2.1 getting a decent utf-8 is the best we can get. We are nearly there with the great work done by Crawford.

Item5529 and Item5566 are the few bugs that are still not closed that are preventing us from claiming utf-8 support. And both seems to be a matter of perl not knowing that specific variables contain utf-8 strings. Probably simple to fix but complicated to debug. Everyone can participate in analysing these two bugs.

-- KennethLavrsen - 26 May 2008

I wrote an outline plan in UseUTF8 some time ago. Might be an idea to focus specifics, such as online/offline/on-the-fly conversion, in that feature request.

-- CrawfordCurrie - 27 May 2008

Another interesting topic that I missed! Glad to see some links to InternationalisationEnhancements etc to avoid re-inventing the wheel.

I don't think it's a matter of closing a couple of bugs before we get UTF-8 support. I think UseUTF8 and my comments on UnicodeProblemsAndSolutionCandidates need some consideration before this is started, as they cover cover topics such as migration of topic/filename data.

I agree this should be done in a major version, and ideally on a separate branch if we are breaking backward compatibility with pre-Unicode character encodings such as ISO-8859-1.

Many changes above in various places, check the diffs. Windows-1252 is not quite the same as ISO-8859-1, in characters that are sometimes used - Google:demoronizer+iso has some links on this. Also fixed the bug where you said that ISO-8859-1 includes the Euro - this is only in -15.

Setting {Site}{Locale} to utf-8 on current TWiki versions only half works and is really only for sites that don't require WikiWords (e.g. Japanese, Chinese, etc) but no good for European languages that want accents to be included in WikiWords. So I don't agree with that recommendation above, and this also conflicts with the advice in InstallationWithI18N.

I also disagree with some of the issues analysis above - ISO-8859-* is and was a good choice for pre-Unicode encodings, since Perl in 2003 wasn't ready for TWiki to go UTF-8.

On performance - I saw a 3 times slowdown when I ran my own TWiki in Unicode mode (i.e. enabling Perl utf8 mode not just handling utf8 as bytes), which is pretty large. I suggest we consider an ASCII only mode for performance on English-only sites unless Perl has dramatically improved recently (or maybe hardware is faster these days, but on hosting sites CPU is still scarce and I don't think anyone wants a big slowdown...) This means some real optimisation after getting the basic UnicodeSupport going, focusing on the key bottlenecks for Unicode processing. Given that the system is processing 3 times as many bytes in some cases, it's reasonable to have a slowdown on those bytes, but it seems that Unicode makes Perl operations go slower even on the characters that are just ASCII. Needs some investigation.

-- RichardDonkin - 14 Jun 2008

I refactored most of your comments into the text, to try and keep the flow of the document. Richard, you are the TWiki God of Encodings, so I for one take what you write as gospel. smile

-- CrawfordCurrie - 14 Jun 2008

I have added quite a lot more material above, including a section on URLs, and completely reversed your recommendation to use UTF-8 in current TWiki versions - this breaks many things and is bad idea. It's only a good idea if you like a broken TWiki!

Also made quite a few additions above under {Site}{Locale} and {Site}{CharSet} to try and clear up a few things.

I now understand better where this is coming from, i.e. the fact that TinyMCE uses Unicode internally and so it's painful to convert to/from the site character set.

-- RichardDonkin - 15 Jun 2008

It's worse than that. Browsers use Unicode internally in the sense that Javascript always uses unicode. So this isn't a problem just for TinyMCE, it's a problem for all client-side applications as well. AJAX nicely sidesteps the problem because XML is UTF8 encoded by default, but if you are not using AJAX - and most TWiki authors aren't - then it's a serious problem.

-- CrawfordCurrie - 15 Jun 2008

Good point about client-side apps generally, I was mostly thinking about server-side as is traditional TWiki but clearly AJAX and so on are increasing demanded.

On $regex{validUtf8StringRegex} this is only used in EncodeURLsWithUTF8 (described in URL section above) to determine if a string is truly valid UTF8 (avoiding overlong encodings in UTF8 which can be used in security exploits) or the site charset. Not very efficient on a large amount of data but not too bad on a small amount - e.g. you might try searching for the first high-bit-set byte in a page, grabbing the next 50 bytes, and using that as a heuristic. Since UTF-8 has a very distinctive encoding it's turned out to be quite reliable, and it's based on what an IBM mainframe web server does - see InternationalisationUTF8 for background.

-- RichardDonkin - 15 Jun 2008

Don't know were else to place this: are all regular expressions used by TWiki Unicode Regular Expressions or should they be?

-- FranzJosefGigler - 20 Oct 2008

Yes and no. The core regexes are compiled to reflect the site character set, so if you select UTF8 on your site, you will get unicode regexes. If we move to unicode in the core, then this complexity disappears.

RD: This is not true I'm afraid - although TWiki today can process 'UTF-8 as bytes', the regexes used by TWiki will work in 'byte mode' only, and will not match UTF-8 characters, only their component bytes - so a regex that you might expect to include accented characters will not work at all, whereas it would work with ISO-8859-1 or some other 8-bit encoding. See the definitions at the top of UseUTF8 for more 'byte mode' vs. 'character mode'

-- CrawfordCurrie - 24 Oct 2008

BasicForm
TopicClassification	TWikiDevDoc
TopicSummary
InterestedParties	RichardDonkin
RelatedTopics	InternationalisationUTF8, InternationalizationSupport, InternationalisationGuidelines, InternationalisationEnhancements, UnicodeSupport, UseUTF8, UnicodeProblemsAndSolutionCandidates

Topic revision: r22 - 2009-10-20 - RichardDonkin

Account
- Log In
- Register User

Edit
Attach

Ideas, requests, problems regarding TWiki? Send feedback. Ask community in the support forum.
Copyright © 1999-2026 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.