Tags:
usability1Add my vote for this tag create new tag
, view all tags

Keyword Search

KeywordSearchWithImplicitAnd has been implemented and released with Cairo (TWiki 3). Now it appears that the notion of "keyword" has lead to confusion and possible wrong implementation. This topic seeks to list possible solutions.

Origin of the bug report was the observation that with keyword search, TWiki would find the page with the word variable when searching for the keyword RIA.

Originally, KeywordSearchWithImplicitAnd was created because "TWiki's search should behave like Google and other modern search engines".

The current implementation appears to be:

  • Enter keywords (= separate words)
  • Find words that contain these keywords
  • For example: ria finds variable

Whereas Google's implementation is:

  • Enter keywords (= separate words)
  • Find words that are these keywords
  • For example: ria finds ria and RIA but not variable
  • Finds expanded acronyms. For example: ria finds Rock Island Arsenal, Rich Internet Applications, and more
  • Finds full-word parts of hyphenated phrases. For example, pro finds pro-choice (and Public Record Office, for that matter)
  • Finds wikiwords when all parts are given. For example Mac pro finds MacPro.

-- Contributors: ArthurClemens, MeredithLesly

Specification

  1. (Peter's suggestion) Add type="word" to SEARCH
  2. Document this option in TWiki.TWikiPreferences at SEARCHDEFAULTTTYPE
  3. Add documentation to TWiki.VarSEARCH

Following the Implementation in progress (see below),

  • type="word" really means: type="keyword" while using word boundaries
  • when keyword is passed:
    • internally (in Search.pm):
    • change type from word to keyword
    • set (new) wordBoundaries to 1

More detailed in Implementation...

Implementation

  • the option wordboundaries is passed to the search implementations (Store::searchInWebContent leads to a call to search on the chosen algorithm {SearchAlgorithm})
  • the search implementations can be used for either keyword or word
    1. When using scope="topic", search type is actually not used because a Perl search on topic names is done. This is actually good for WebSearch because no extra intelligence complications have to built in to distinguish when searching on title or on text.
    2. In Forking.pm regex search must be used using EgrepCmd:
      ( $options->{type} eq 'regex' || $options->{wordboundaries} == 1 ) $program = $TWiki::cfg{RCS}{EgrepCmd};
    3. and a few lines below, add the word boundary check to the search string (actually now one word at a time is searched, but will that always be the case? Update: It doesn't: quoted search strings are searched as-is, but that does not change the word boundaries:
      $searchString =~ s/^(.*)$/\\b$1\\b/go if $options->{'wordboundaries'};
    4. The same line goes into PurePerl.pm and NativeSearch.pm

Bug entry: Bugs:Item2123

-- ArthurClemens - 02 May 2007

Discussion

In my topic I have defined a search for RIA. But the search also finds topics with variable. This shouldn't be with type=keyword.

AC


That is a question how to define "keyword". For TWiki it is any set of number of characters, separated by space. A number of characters is anywhere in a word, not just at word boundaries.

This is an enhancement IMHO. Could be done with a new type=word to keep it compatible.

Renamed summary from "Search finds topics it should not find" to "Add new type=word search to search on word boundaries"

-- PTh


The Rest Of The World defines "keywords" as "words", not as word parts or a collection of characters. If TWiki searches anywhere in a word this is a bug.

I always thought "literal search" was for finding a series of characters.

Changing priority back to normal (bug).

AC


It was agreed at the release meeting May 2nd that this is an enhancement and that your proposed change will be non-backwards compatible and break many twikiapps if implemented as you suggest. We cannot just change the way search works. It WILL break twikiapps!

Peter's change was agreed to be a backwards compatible solution for those that want to make a twiki app that only find words.

Changing back to enhancement

KJL


That's a shame really. So the first implementation of "keyword" search was wrong. We can create a new word type search, but this will increase confusion. The documentation should at least be clear on this.

AC


TWiki:Codev.KeywordSearchWithImplicitAnd says about keyword search:
Add a new "keyword" search type besides the existing "literal" search and "regex" search. Users expect keyword search, e.g. TWiki's search should behave like Google and other modern search engines

Words and literal text are delimited by space.

This seems very clear. Changed status back to bug.

AC

And I change is back to enhancement.

Since "the old romans" TWiki's search feature has worked the way it works today.

There are 10000s of TWiki Applications out there using search. Search is one of the most commonly used features. If we change the search so it suddenly looks for whole words only instead of parts of words it will break quite many of those TWiki Applications. We are not talking about a bug here where something never worked for anyone and we are fixing it. Searching for strings that can be part of a word is a fully valid feature and people - including myself - has been using it for years. Searching for whole words is a missing feature that should be added.

If we want to be able to search for whole words only - we have to enhance the search feature with a new keyword. Not change it.

We will screw up TWiki Applications for our current users if we change the behavour.

Naturally the documentation needs to be clear on all of this. It is sad that the current keywords may be confusing but fixing it breaks things. And I must admit that keyword could be misunderstood. But what word is better which is a common word and not a programmer nerd word. English does not have enough words to always be able to specifically describe something in ONE word. But if we add a type="word" and both type="keyword" and type="word" are documented properly I think we can contain the confusion to a reasonable level. People need to look at the SEARCH doc anyway to know what to add in a SEARCH tag.

I do not want to start a Priority field war. I just want to make sure noone "fixes" the current behavour when seeing this bug report but properly understand that it is adding an additional feature to the search to maintain backwards compatibility. Requirement does not mean low priority.

KJL

This 'feature' has been introduced in Cairo, and it has been implemented differently from the documentation. 10000s of search applications seems to be a very high estimate.

So we now have "keyword search" and "literal search" that both do the same thing. Or?

Peter writes about keyword: For TWiki it is any set of number of characters, separated by space. That definition seems ok, as long as it means separated by space on both sides. But clearly TWiki keyword search doesn't work this way, because it finds variable when searching for RIA.

Shouldn't this be repaired?

If people are using "keyword search" when searching for word parts, they are not following the search docs. In fact, if they would follow the doc, they would use 'literal search', not keyword search.

The default search type is "literal search" (defined in %SEARCHVAR- DEFAULTTYPE%). So people have willingly set a search to keyword search. And for these people, their apps will be very easy to update.

AC

TWiki has been installed at least 1000 places. There are at least 10 topics with a search on each. In fact there are probably 100s of searches on each TWiki installation and from the download numbers there could be 1000s of TWikis out there. We have to think about our customers before we go and change things.

And you have seen some of the sharp reactions from customers both on the mailing list and on twiki.org when we change even smaller things.

Literal search is not the same as keyword search.

It is the "search string" where each keyword is seperated by spaces. That is what Peter is talking about.

When I search for "hund" on a Danish language site I expect to find "hund", "hunden", "hunde", "hundnene". The keyword search enables the use of + and -. If I search for "+hund -kat" I expect to find topics with "hunde" but not "katte". Many people will not see the current behavour as a bug. But I do see the need for the type="word" with the same spec as keyword but looking only for whole words. And I would probably also change the generic search in top bar and the standard search topics to be "word" instead of "keyword".

And may I remind that this bug was discussed at a release meeting and the people present agreed that it was better to maintain compatibility and extend the feature with type="word". This is not a Peter and Kenneth idea.

KJL

I repeat (because it doesn't seem to hit): TWiki:Codev.KeywordSearchWithImplicitAnd was created because "TWiki's search should behave like Google and other modern search engines".

What a total misunderstanding of the concept of "keyword". With Google you don't use a "keyword" such as ria to find pages with variable. Regardless of what you expect, it has been implemented in the wrong manner and in that sense it is a real bug.

Please elude me on "literal search", because I couldn't find a definition for it. Wikipedia:String_literal only confuses me. But my intuition says literal search is "search for a phrase", as written elsewhere.

In the above example SearchTest I have added a "literal" search query, and it outputs exactly the same result as keyword search. I think that with your example of hund you could use literal search as well.

Note again that literal search is the default.

AC

-- ArthurClemens - 22 May 2006

Let's see the spec of SEARCH, as it's defined in TWikiVariables and SearchHelp. From the former, on can read:

 Do a [[http://dictionary.reference.com/search?q=keyword][keyword [keyword]] search like soap "web service" -shampoo

From the latter:

   * Specify word(s) you want to find
...
   * Example: To search for all topics that contain "SOAP", "WSDL", a literal "web service", but not "shampoo",
enter this: soap +wsdl "web service" -shampoo 

From the user point of view, it means that looking for a keyword is looking for a keyword, not looking for a string embedded inside another string (that what regexp search is for, as documented).

So we have two paths:

  • Remain "bug-compatible" so all the TWikiApps that where built taking advantage of a bug and works against the spec won't break.
  • Fix the behavior of keyword

The later can be done in a compatible way. Instead of adding a new search type, we can add a "modifier" to the keyword search (

boundary="word/none"
?) so it either behaves like it behaves today or like it should behave. The default should be "how it should behave", but the upgrade document should state that there was a change with the shipped defaults, and if there is an installed app that depends on the old behavior they should change the switch.

That means that I vote we fix keyword.

-- RafaelAlvarez - 22 May 2006

BTW, the same argument in favour of not fixing keyword can be used against changing ANY part of the Core because some of the 100s plugins out there that work around the Func API (so working against the spec) won't break. And I mean both changes in behavior and structure.

-- RafaelAlvarez - 22 May 2006

Be careful not to assume knowing the single and only truth about what keyword means and making old TWiki topics Gospel. Please try to understand our customers need for being able to trust TWiki as a stable product.

The type="word" is just as fine a definition for searching for whole words as type="keyword". It extends the spec of SEARCH without breaking anything.

You cannot just claim that searching for keywords MUST mean only whole words.

Let me elaborate on this for a while.

In my language Danish nouns made from two or more nouns are put together as one word. "Battery Adapter" becomes "Batteriadapter" in Danish. "Stock market" is "aktiemarked" in Danish made up from "aktie" and "marked". Normal users looking for keywords aktie expects to find "aktiemarked", "aktiekurs", "toneangivende aktier" etc and they do not expect to know anything about regular expressions to do that.

Literal search will not do it. It will not work for "+aktie -bank" to find pages with "aktie" but not bank.

The KeywordSearchWithImplicitAnd specifies the definition of the search string and does not contain any accurate definition of how it actually searches. To define searching for whole words actually extends much further than a string with white space around.

What about looking for "Lavrsen". Shouldn't it find my name when it is part of a KennethLavrsen?

What about "aktie-marked"? When I search for "aktie" on Google I get many hits with "aktie-marked" and "aktie-kurs". What about "(aktie)"? Or 'aktie' with the quotes? Or |aktie| in a table? Or when I search for Glostrup (my city) how about Glostruphallen? Or if I search for "Glostrup sport" shouldn't I find a topic with the text "sportsforretninger i Glostrup"? What in English is "a dog" and "the dog" is in Danish "en hund" and "hunden". No "the" word in front. We suffix the word with -en. We Danes often want to find "hunden" as well as "hund" when we search for "hund".

I can continue with "hund." "hund," "hund;", "hund2", "hund-", "politi-hund". What characters make up a word boundary and how does it work in an i18n environment? None of this was defined in KeywordSearchWithImplicitAnd because this is not what that topic was about.

And even if it was originally expected to be a full word search by some of the original debaters whoever implemented the code found that the features should be implemented differently. You cannot say the code is buggy because it does find the words and it does not miss anything out or find something which is not there.

If the search code found things that are not there or missed words you could say that it is a bug.

You can argue for both ways of word searching to work. Noone owns the truth. My truth is a true valid as yours. You cannot come claiming that things that do not work according to your truth is a bug.

This is not a COS(X) function that returns a wrong value or a search that returns garbage characters or whatever shape a software bug can show themselves in. This is a simple discussion about how a feature should work and it is a feature which has been released with the current behavour 1.5 years ago and used by 10000s of end-users. Some fraction of these users rely on the current implementation of the search feature and there is absolute no good reason to start breaking anything for them when there is a very good alternative way to do it.

Why does things so often have to turn into a religious war? Why is compromize so difficult?

-- KennethLavrsen - 23 May 2006

It is not reasonable to have a keyword search degenerate into a literal search. A keyword search for net shouldn't turn up KennethLavrsen, because there is no way to claim that net is a meaningful part of KennethLavrsen. ria should not find variable or striated or any other word that those three letters happen to be part of.

Yes, there's room for compromise. There isn't room for completely ignoring what keyword signifies.

-- MeredithLesly - 23 May 2006

We are not completely ignoring the "keyword" case. We just implement it with type="word" and leave the "keyword" working like today. That is the compromise that should satisfy all parties.

-- KennethLavrsen - 23 May 2006

If a keyword search finds variable when searching for ria, then it's implemented wrong. Period.

-- MeredithLesly - 23 May 2006

It is KeywordSearchWithImplicitAnd itself where is written "TWiki's search should behave like Google and other modern search engines". You can have your interpretation of searching word parts, but this was the goal and it was not met.

-- ArthurClemens - 23 May 2006

Kenneth, I think that you misunderstood me. My point is less about the true meaning of keyword, and more about what the documentation says vs what the code says and how users should behave. Let me explain:

If the documentation says A and the code does B, then either the documentation or the code is wrong. So, how do we decide which one is wrong?

From the user point of view, if the documentation says A and the system does B then the system is wrong and buggy. The user has the option to either raise a bug report or be silent and try to find a workaround.

We either make the latter users to be be very, very aware that their workaround is likely to break in a future release, or we are complacent and thus promote "bugs" to "features".

The other case is when the user finds an "undocumented" feature. We should make very clear that undocumented features are not guaranteed to work in new releases (even if we promote and document some of those), or we run the risk of having to maintain a poorly-made choice.

-- RafaelAlvarez - 23 May 2006

To find a compromise first we need to identify the oposing sides:

  • One Side: Keyword searches should maintain their current semantics so users won't be affected
  • Other Side: type="keyword" should be fixed. period. (screw the users?)

If these are the opposing sides, type="word" is not a compromise, is imposing the One Side.

I did made a compromise. I proposed a change that "fixes" keyword behavior in a bug-compatible way. So One Side get the current semantic and users won't be affected, and Other Side get type="keyword" fixed.

And everybody is happy. (or so I hope)

-- RafaelAlvarez - 23 May 2006

Interestingly, in that topic, PTh comments "I assert that most applications use a regular expression search, so the chance is small that this spec change breaks existing content."

But, yes, Arthur, you are quite right and spec was agreed to. It's unfortunate that apparently there's a serious bug in it. If there are installations relying on the bug, then they implemented things wrong.

As noted elsewhere, Google does not grok camelcase, which is unfortunate and, for all we know, may be remedied by them. I don't have a problem in extending the meaning of keyword to include complete portions of a wikiword, although I'm OK either way. It does seem a reasonable extension to allow searching for Lavrsen to turn up KennethLavrsen even if it isn't exactly what Google does. But, once again, pretending that keyword means literal and papering it over by adding a new kind of search isn't the right solution.

-- MeredithLesly - 23 May 2006

Kenneth: out of curiosity, where do you get "TWiki has been installed at least 1000 places"?

-- MeredithLesly - 23 May 2006

Googling for aktie with Danish as the language turns up aktie-blodbad, aktie-dyk, solenergi-aktie, etc. That is, it finds words, where a word is defined as something surrounded by white space, where the match is identical or a part of the word that's surrounded by whitespace or punctuation, most notably dashes. If TWiki does not find "parts of words" in this manner, that is a bug according to the documentation. It shouldn't, however, find aktieanalyser or Aktieinfo or aktiema. (I mention these because they're in the third hit I get, and they're not in bold.)

If people have created keyword searches that rely on behaving differently than Google's behaviour, then they wrote broken searches even if they happened to work due to a bug. If TWiki is to be considered reliable, it must fix bugs when they're found, and not leave them in because someone is relying on buggy behaviour.

-- MeredithLesly - 24 May 2006

Let's cool down. We agree that we disagree in the meaning of "keyword". It's clear that the code doesn't do what the documentation say it should. And there is the patent posibility that there are TWikiApps out there relying on the bug.

Let's stop arguing and look at the posibilities/proposals we have, to decide what are we going to do.

  • Fix keyword to work as documented
    • (Cons) May break some TWikiApps.
    • (Pro) Works as expected from a traditional keyword search.
  • Fix the documentation so it describe the current behavior
    • (Pro) Some TWikiApps will still work.
    • (Cons) Won't follow the "principle of least astonishment" for users accustomed to traditional keyword search.
  • Add type="word" to the options, to perform the "keyword search" as documented, keep the current behavior of type="keyword"
    • (Pro) TWikiApps will still work
    • (Pro) Both embedded and whole words search will be easily performed
    • (Cons) Won't follow the "principle of least astonishment" for users accustomed to traditional keyword search.
    • (Cons) At first sight, there is no patent difference between type="keyword" and type="word", because type="keyword" doesn't convey the traditional meaning.
    • (Cons) Add more options to the already complex %SEARCH% tag
  • Fix type="keyword" to work by the spec, add a new option (boundary?) to choose between the "keyword search" as documented or the current behavior. The default for this parameter should be configurable.
    • (Pro) TWikiApps will still work
    • (Pro) Both embedded and whole words search will be easily performed
    • (Cons) Add more options to the already complex %SEARCH% tag
    • (Cons) Adds more settings to the already huge set of TWiki settings

Disclaimer: I'm assuming that the traditional meaning for keyword is the one found in the dictionary, which I assume is the one most people is used to.

-- RafaelAlvarez - 24 May 2006

Unless we change the documentation, we should be using the Google definition of keyword. This is, after all, the best way of determining what people expect, since Google is widely used.

That's why I experimented with Google, using one of Kenneth's example searches. In particular, a search should turn up a match in a hyphenated "word", as described above, not just words surrounded by whitespace. (Whitespace is wrong, anyway, because of punctuation; I assume that it was never intended to use only white space to find keywords.)

If it doesn't add a lot of hair to the search code, then it seems the best solution would be to fix the keyword search behaviour and add a global configuration option that would cause keyword search to behave the way it does now. That avoids further complicating SEARCH while allowing sites that have used keyword search in the broken way to continue working. We should, however, also provide documentation on how to rewrite searches to conform to the spec.

-- MeredithLesly - 24 May 2006

It is clear this discussion is in a status quo. But we can tackle this in a practical way (in the line of Meredith's last comment) - see #Specification above.

Shouldn't SEARCHDEFAULTTTYPE be in configure?

-- ArthurClemens - 21 Apr 2007

Yes, adding type="word" to search is a good solution.

I think SEARCHDEFAULTTTYPE should remain in TWikiPreferences since it might get redefined at a lower level.

-- PeterThoeny - 22 Apr 2007

Good to see this being worked on again and with a new spec.

Since this is an old rejected proposal the 14-day rule should not apply. When new spec is agreed please summarize and put forward for release meeting.

-- KennethLavrsen - 23 Apr 2007

I've added #Implementation in progress at the top.

-- ArthurClemens - 23 Apr 2007

One detail that needs to be investigated: The Jump box feature top drill down to a topic by typing part of the name would no longer work if type="word" is enabled at a site. That means we need to explicitly specify type="keyword" in a number of queries, just to guart for cases where an admin changes the default from type="keyword" to type="word".

-- PeterThoeny - 24 Apr 2007

I have finished the implementation.

-- ArthurClemens - 02 May 2007

Great, Arthur!

I think the doc is pending in the description of TWiki.TWikiPreferences's SEARCHDEFAULTTTYPE.

-- PeterThoeny - 08 May 2007

Edit | Attach | Watch | Print version | History: r27 < r26 < r25 < r24 < r23 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r27 - 2007-05-08 - PeterThoeny
 
  • Learn about TWiki  
  • Download TWiki
This site is powered by the TWiki collaboration platform Powered by Perl Hosted by OICcam.com Ideas, requests, problems regarding TWiki? Send feedback. Ask community in the support forum.
Copyright © 1999-2017 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.