Tags:
create new tag
, view all tags

Question

When I search for a word with uppercase umlauts in a topic using a correct fragment of this word including that umlaut as search pattern, I don't get an appropriate result:

  • Simple (standard) search caseinsensitive without regular expressions. (casesensitive=off, regex=off)

I do get a correct result:

  • Simple search, case sensitive with exact upper-/lowercase writing of search pattern (casesensitive=on, regex=off)
  • Regex-Search, both with and without case sensitivity.

So, it seems to me that the standard search has trouble with ignoring case with umlauts in the search pattern.

How is the simple search implemented? Can I break down the problem?

My configuration attached as requested by the Support Guidelines.

Best regards AndreasMock

Environment

TWiki version: TWikiRelease04x01x02
TWiki plugins: DefaultPlugin, EmptyPlugin, InterwikiPlugin
Server OS: SuSE Enterprise Linux 9 SP3
Web server: Apache 2.0.49
Perl version: perl 5.8.3
Client OS: Windows 2000 SP4+
Web Browser: IE 6, FF 2
Categories: Search, Localisation

-- AndreasMock - 14 Mar 2007

Answer

ALERT! If you answer a question - or someone answered one of your questions - please remember to edit the page and set the status to answered. The status selector is below the edit box.

Thanks for attaching your config, which made it easy to diagnose this.

The problem is that you are using a locale of de_DE.utf8, and UTF-8 is not supported in TWiki yet (see UnicodeSupport). Instead, just use de_DE.ISO-8859-1 everywhere and you will find that searching for umlauts in upper or lower case works fine, thanks to the InternationalisationEnhancements.

For installation with locales, see I18N which links to the I18N installation document.

You may find that you need to generate the locale on your box (required with Ubuntu for ISO-8859-* locales), or it may be pre-installed - try locale -a to check what's already there.

Any pages with non-ASCII data within them will need to be migrated since you are changing locale - I suggest you do a full backup of your data and pub directories before making this change, To convert filenames, see convmv (http://j3e.de/linux/convmv/), or http://qa.mandriva.com/twiki/bin/view/Main/MandrivaLinux2007Errata#UTF8_issue_when_reinstalling_and this tool (convmv looks better supported).

To convert the data, see iconv - a handy FAQ from Novell is http://www.novell.com/coolsolutions/qna/1786.html here but note the direction of conversion is the opposite to what you need. Some sort of find . -print | while read x; do iconv stuff ; done shell script would be useful.

-- RichardDonkin - 14 Mar 2007

Hi Richard,

thank you for your fast answer. I read many topics meanwhile, but it's really hard to get all parts together. That's my first impression after about 4 days of intensive work trying to install properly.

I did a 'locale -a' as stated by the installation document and found out that I only have utf8-like locales. That's the reason I took it. I also jumped directly into a pitfall of version 4.1.2 when using the locale 'de_DE.utf8' without setting charset (Installation Guide: I must not do that!).

Probably it would be helpful to give the explicit advice that you have to use an one-byte-encoding at the moment and that you have to create an appropriate locale if that doesn't exist when you run 'locale -a'.

My problem is, that informations that are necessary to get a vague picture of TWiki are spread over so many documents. Additionally I do never know if one of the topics is valid any more concerning the current status of development (e.g. topic was created many years ago).

Your reply summarized what I didn't understand so far (even while reading it). Thank you for that.

-- AndreasMock - 14 Mar 2007

The more I read the mentioned topics again the less I understand why I didn't realized NOT to use UTF-8. (Excuse: You see the difference of reading a text without background and with background. ;-))

Sorry for bothering anybody.

-- AndreasMock - 14 Mar 2007

I agree about the docs being of variable quality, but the official docs and the SupplementalDocuments are pretty good generally. The TWiki community does really need to retire old TWiki topics, as there are many topics from years ago that should be archived into another web to avoid confusion.

One question: can you provide the commands on SUSE to generate a locale? From a few web searches it seemed as if the glibc-locale package included a wide range of locales.

I took the opportunity to update InstallationWithI18N to make the 'don't use UTF-8' part much clearer!

-- RichardDonkin - 14 Mar 2007

Hi Richard,

sorry for answering so late. I didn't see your question.

I can't answer your question as I did the following: I just use the locale de_DE.iso-8859-1 even if it is not listed by locale -a. I first tried de_DE.iso-8859-15 but got suddenly an perl error stating that this conversion couldn't be done.

I really don't understand the following: 1) locale -a shows a bunch of locales. Part of them are locales of the form de_DE without and explicit encoding. What does that mean?

2) locale -m shows many character mappings. But not every character mapping is shown in the output of locale -a. How does that two parts fit together?

3) Why can I use de_DE.iso-8859-1 out of the box but not de_DE.iso-8859-15?

You see, I have to investigate the whole locale stuff a little bit more. Good documentation for this topic seems hard to find.

Best regards Andreas Mock

-- AndreasMock - 20 Mar 2007

Hi there, I just stumbled over (seemingly) the same problems with umlauts when trying to install Twiki on Suse 10.1.

1) de_DE.iso-8859-15 is not a valid as {Site}{Locale} here. Browsing through some SuSE forums one finds out that  that using de_DE@euro for the {Locale} and iso-8859-15 for the {Charset} should work, but not really.

2) Or, one can try de_DE.iso-8859-1 and iso-8859-1 with the very same result.

What seems happen: Even if you set the Twiki locale to ISO-8859, Apache still emits UTF-8 code. Everything only looks OK when you set your browser to UTF-8, but the web page is not tagged correctly, you need to set the browser manually to UTF-8 for correct display.

My explanation: It seems since 9.x, SuSE Linux is almost completely (tooo much) UTF-8 (by default), even if you try to set a different locale.

Solutions (in the order I found out):

I) Dumb solution; Change system wide character set: See here http://en.opensuse.org/Change_system_wide_character_set

II) Smart solution: Only force the default Apache character set to ISO-8859-15 (or whatever you like): Add

AddDefaultCharSet ISO-8859-15

at the end of

/etc/apache2/mode_mime-defaults.conf

(Or adjust that line if it is still there, 10.x seems to not have it any longer)

together with the {Locale} and {Charset} of 1) above and everything works like charm! The emitted web page is now correctly tagged and detected as ISO-8859-15 (Latin9), all umlauts are there! (Win IE6)

Hope this helps!

(Sorry for the bad formatting, I´m very new to Twiki)

-- JoachimWesner - 23 Apr 2007

Change status to:
Topic attachments
I Attachment History Action Size Date Who Comment
HTMLhtm TWiki_Configuration.htm r1 manage 195.5 K 2007-03-14 - 10:59 AndreasMock Current configuration of version 4.1.2
Edit | Attach | Watch | Print version | History: r6 < r5 < r4 < r3 < r2 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r6 - 2007-04-24 - JoachimWesner
 
  • Learn about TWiki  
  • Download TWiki
This site is powered by the TWiki collaboration platform Powered by Perl Hosted by OICcam.com Ideas, requests, problems regarding TWiki? Send feedback. Ask community in the support forum.
Copyright © 1999-2017 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.