Bug: Non-Roman Alphabet characters in form field names removed
See
Support.InternationalCharactersInFormFields for details - this affects Greek, Cyrillic and East Asian languages where Roman alphabetic characters are rarely used. It would also corrupt any use of accented characters in field names in European languages.
Test case
- Create fields called 'יי' and 'חיי' using TWikiForms and enter two different values
- Field names and values do not appear
See
Support.InternationalCharactersInFormFields for real world example.
Environment
Any version of TWiki up to 12 Dec 2004.
Fix record
Fixed in
SVN DEVELOP.
Note that there really needs to be a setting for 'this is a primarily non-alphabetic locale', so that there's a way of only removing non-alpha characters when using an alphabetic language (including Greek and Cyrillic but not including Japanese). When this setting is off, any character could be used in form fields. See the TODO in
SVNget:lib/TWiki/Form.pm
.
--
RichardDonkin - 12 Dec 2004
Richard, simply commenting out
$text =~ s/[^A-Za-z0-9_\.]//go; is likely to cause incompatibilities and could break
TWikiApplications. TWiki stores the field name in two formats, as title and name, respectively. Commenting out the filter results in title and name to be the same. Example field with space:
%META:FIELD{name="TopicClassification" title="Topic Classification" value=""}%
This should be reverted to the way it was before (not supporting
I18N), or done properly with a filter based on the locale.
--
PeterThoeny - 13 Dec 2004
I'll have a look at this - what would need to happen is dependent on the type of language involved:
- Alphabetic languages including Greek and Cyrillic - allow only alphabetic characters, same behaviour as now but works for Greek and Cyrillic as well as Roman languages
- All other languages (e.g. Japanese) - these typically don't have a concept of WikiWord so it's not a problem that the field name is the same as the title IMO (e.g. could be two Japanese words/characters).
Until we have full Unicode support for sites using Perl 5.8, we can't do much about the second case, so this would just have name = title, which I think is OK.
I'm not sure about why it's useful to set the title (human readable version) and name (cleaned up version) to different values when using non-alphabetic languages - after all, the
name is just intended to look up the field name in the form definition (ref:
TWikiMetaData spec). If you are using Japanese, you can't create
WikiWords anyway and there won't be any spaces in either the name or the title.
As we move to support East Asian languages such as Japanese and Chinese better, through Unicode, we won't be able to use locales anyway, and matching based on valid 'letter' characters becomes language dependent - e.g. to know if a form field name is valid Japanese characters you would have to have it marked as Japanese (or perhaps mark the whole page as Japanese). This language marking is fairly painful so I haven't seen many applications do it, though it is important for displaying Japanese and Chinese characters properly - it may be enough to mark the whole site as one language (already supported through the
%LANG% variable), but allow other language characters anyway, except for Japanese strings in a Chinese site (say), which would need to be properly marked as Japanese.
The summary is that matching on locale is not as easy as it looks when using Unicode, so as far as possible we should do script/language independent matching - perhaps using
[\p{Letter}\p{Mark}] to match on letters and combining characters (accents and so on, where written as separate Unicode characters) - this should cover all scripts' concept of letters, including East Asian languages.
--
RichardDonkin - 13 Dec 2004