Question
When I search for a word with uppercase umlauts in a topic using a correct fragment of this word including that umlaut as search pattern, I don't get an appropriate result:
- Simple (standard) search caseinsensitive without regular expressions. (casesensitive=off, regex=off)
I do get a correct result:
- Simple search, case sensitive with exact upper-/lowercase writing of search pattern (casesensitive=on, regex=off)
- Regex-Search, both with and without case sensitivity.
So, it seems to me that the standard search has trouble with ignoring case with umlauts in the search pattern.
How is the simple search implemented? Can I break down the problem?
My configuration attached as requested by the Support Guidelines.
Best regards
AndreasMock
Environment
--
AndreasMock - 14 Mar 2007
Answer
If you answer a question - or someone answered one of your questions - please remember to edit the page and set the status to answered. The status selector is below the edit box.
Thanks for attaching your config, which made it easy to diagnose this.
The problem is that you are using a locale of
de_DE.utf8, and UTF-8 is not supported in TWiki yet (see
UnicodeSupport). Instead, just use
de_DE.ISO-8859-1 everywhere and you will find that searching for umlauts in upper or lower case works fine, thanks to the
InternationalisationEnhancements.
For installation with locales, see
I18N which links to the
I18N installation document.
You may find that you need to generate the locale on your box (required with Ubuntu for ISO-8859-* locales), or it may be pre-installed - try
locale -a to check what's already there.
Any pages with non-ASCII data within them will need to be migrated since you are changing locale - I suggest you do a full backup of your data and pub directories before making this change, To convert filenames, see
convmv (
http://j3e.de/linux/convmv/
), or
http://qa.mandriva.com/twiki/bin/view/Main/MandrivaLinux2007Errata#UTF8_issue_when_reinstalling_and
this tool (convmv looks better supported).
To convert the data, see
iconv - a handy FAQ from Novell is
http://www.novell.com/coolsolutions/qna/1786.html
here but note the direction of conversion is the opposite to what you need. Some sort of
find . -print | while read x; do iconv stuff ; done shell script would be useful.
--
RichardDonkin - 14 Mar 2007
Hi Richard,
thank you for your fast answer. I read many topics meanwhile, but it's really hard to get all parts together. That's my first impression after about 4 days of intensive work trying to install properly.
I did a 'locale -a' as stated by the installation document and found out that I only have utf8-like locales. That's the reason I took it. I also jumped directly into a pitfall of version 4.1.2 when using the locale 'de_DE.utf8' without setting charset (Installation Guide: I must not do that!).
Probably it would be helpful to give the explicit advice that you have to use an one-byte-encoding at the moment and that you have to create an appropriate locale if that doesn't exist when you run 'locale -a'.
My problem is, that informations that are necessary to get a vague picture of TWiki are spread over so many documents. Additionally I do never know if one of the topics is valid any more concerning the current status of development (e.g. topic was created many years ago).
Your reply summarized what I didn't understand so far (even while reading it). Thank you for that.
--
AndreasMock - 14 Mar 2007
The more I read the mentioned topics again the less I understand why I didn't realized NOT to use UTF-8. (Excuse: You see the difference of reading a text without background and with background. ;-))
Sorry for bothering anybody.
--
AndreasMock - 14 Mar 2007
I agree about the docs being of variable quality, but the official docs and the
SupplementalDocuments are pretty good generally. The TWiki community does really need to retire old TWiki topics, as there are many topics from years ago that should be archived into another web to avoid confusion.
One question: can you provide the commands on SUSE to generate a locale? From a few web searches it seemed as if the glibc-locale package included a wide range of locales.
I took the opportunity to update
InstallationWithI18N to make the 'don't use UTF-8' part much clearer!
--
RichardDonkin - 14 Mar 2007
Hi Richard,
sorry for answering so late. I didn't see your question.
I can't answer your question as I did the following: I just use the locale
de_DE.iso-8859-1 even if it is not listed by
locale -a. I first tried
de_DE.iso-8859-15 but got suddenly an perl error stating that this conversion couldn't be done.
I really don't understand the following: 1)
locale -a shows a bunch of locales. Part of them are locales of the form
de_DE without and explicit encoding. What does that mean?
2)
locale -m shows many character mappings. But not every character mapping is shown in the output of
locale -a. How does that two parts fit together?
3) Why can I use
de_DE.iso-8859-1 out of the box but not
de_DE.iso-8859-15?
You see, I have to investigate the whole locale stuff a little bit more. Good documentation for this topic seems hard to find.
Best regards Andreas Mock
--
AndreasMock - 20 Mar 2007
Hi there, I just stumbled over (seemingly) the same problems with umlauts when trying to install Twiki on Suse 10.1.
1)
de_DE.iso-8859-15 is not a valid as {Site}{Locale} here. Browsing through some
SuSE forums one finds out that that using
de_DE@euro for the {Locale} and
iso-8859-15 for the {Charset} should work, but not really.
2) Or, one can try
de_DE.iso-8859-1 and
iso-8859-1 with the very same result.
What seems happen: Even if you set the Twiki locale to
ISO-8859, Apache still emits UTF-8 code. Everything only looks OK when you set your browser to UTF-8, but the web page is not tagged correctly, you need to set the browser manually to UTF-8 for correct display.
My explanation: It seems since 9.x,
SuSE Linux is almost completely (tooo much) UTF-8 (by default), even if you try to set a different locale.
Solutions (in the order I found out):
I) Dumb solution; Change system wide character set: See here
http://en.opensuse.org/Change_system_wide_character_set
II) Smart solution: Only force the default Apache character set to ISO-8859-15 (or whatever you like): Add
AddDefaultCharSet ISO-8859-15
at the end of
/etc/apache2/mode_mime-defaults.conf
(Or adjust that line if it is still there, 10.x seems to not have it any longer)
together with the {Locale} and {Charset} of 1) above and everything works like charm! The emitted web page is now correctly tagged and detected as ISO-8859-15 (Latin9), all umlauts are there! (Win IE6)
Hope this helps!
(Sorry for the bad formatting, I´m very new to Twiki)
--
JoachimWesner - 23 Apr 2007