Keyword Search
KeywordSearchWithImplicitAnd has been implemented and released with Cairo (TWiki 3). Now it appears that the notion of "keyword" has lead to confusion and possible wrong implementation. This topic seeks to list possible solutions.
Origin of the
bug report
was the observation that with keyword search, TWiki would find the page with the word
variable when searching for the keyword
RIA.
Originally,
KeywordSearchWithImplicitAnd was created because "TWiki's search should behave like Google and other modern search engines".
The current implementation appears to be:
- Enter keywords (= separate words)
- Find words that contain these keywords
- For example:
ria finds variable
Whereas Google's implementation is:
- Enter keywords (= separate words)
- Find words that are these keywords
- For example:
ria finds ria and RIA but not variable
- Finds expanded acronyms. For example:
ria finds Rock Island Arsenal, Rich Internet Applications, and more
- Finds full-word parts of hyphenated phrases. For example,
pro finds pro-choice (and Public Record Office, for that matter)
- Finds wikiwords when all parts are given. For example
Mac pro finds MacPro.
--
Contributors: ArthurClemens,
MeredithLesly
Specification
- (Peter's suggestion) Add
type="word" to SEARCH
- Document this option in TWiki.TWikiPreferences at
SEARCHDEFAULTTTYPE
- Add documentation to TWiki.VarSEARCH
Following the Implementation in progress (see below),
-
type="word" really means: type="keyword" while using word boundaries
- when
keyword is passed:
- internally (in
Search.pm):
- change
type from word to keyword
- set (new)
wordBoundaries to 1
More detailed in Implementation...
Implementation
- the option
wordboundaries is passed to the search implementations (Store::searchInWebContent leads to a call to search on the chosen algorithm {SearchAlgorithm})
- the search implementations can be used for either
keyword or word
- When using
scope="topic", search type is actually not used because a Perl search on topic names is done. This is actually good for WebSearch because no extra intelligence complications have to built in to distinguish when searching on title or on text.
- In
Forking.pm regex search must be used using EgrepCmd:
( $options->{type} eq 'regex' || $options->{wordboundaries} == 1 ) $program = $TWiki::cfg{RCS}{EgrepCmd};
- and a few lines below, add the word boundary check to the search string (actually now one word at a time is searched, but will that always be the case? Update: It doesn't: quoted search strings are searched as-is, but that does not change the word boundaries:
$searchString =~ s/^(.*)$/\\b$1\\b/go if $options->{'wordboundaries'};
- The same line goes into
PurePerl.pm and NativeSearch.pm
Bug entry:
Bugs:Item2123
--
ArthurClemens - 02 May 2007
Discussion
In my topic I have defined a search for
RIA. But the search also finds topics with
variable. This shouldn't be with
type=keyword.
AC
That is a question how to define "keyword". For TWiki it is any set of number of characters, separated by space. A number of characters is anywhere in a word, not just at word boundaries.
This is an enhancement IMHO. Could be done with a new type=word to keep it compatible.
Renamed summary from "Search finds topics it should not find" to "Add new type=word search to search on word boundaries"
--
PTh
The Rest Of The World defines "keywords" as "words", not as word parts or a collection of characters. If TWiki searches
anywhere in a word this is a bug.
I always thought "literal search" was for finding a series of characters.
Changing priority back to normal (bug).
AC
It was agreed at the release meeting May 2nd that this is an enhancement and that your proposed change will be non-backwards compatible and break many twikiapps if implemented as you suggest. We cannot just change the way search works. It WILL break twikiapps!
Peter's change was agreed to be a backwards compatible solution for those that want to make a twiki app that only find words.
Changing back to enhancement
KJL
That's a shame really. So the first implementation of "keyword" search was wrong. We can create a new
word type search, but this will increase confusion. The documentation should at least be clear on this.
AC
TWiki:Codev.KeywordSearchWithImplicitAnd
says about keyword search:
Add a new "keyword" search type besides the existing "literal" search and "regex" search. Users expect keyword search, e.g. TWiki's search should behave like Google and other modern search engines
Words and literal text are delimited by space.
This seems very clear. Changed status back to bug.
AC
And I change is back to enhancement.
Since "the old romans" TWiki's search feature has worked the way it works today.
There are 10000s of TWiki Applications out there using search. Search is one of the most commonly used features. If we change the search so it suddenly looks for whole words only instead of parts of words it will break quite many of those TWiki Applications. We are not talking about a bug here where something never worked for anyone and we are fixing it. Searching for strings that can be part of a word is a fully valid feature and people - including myself - has been using it for years. Searching for whole words is a missing feature that should be added.
If we want to be able to search for whole words only - we have to
enhance the search feature with a new keyword. Not change it.
We will screw up TWiki Applications for our current users if we change the behavour.
Naturally the documentation needs to be clear on all of this. It is sad that the current keywords may be confusing but fixing it breaks things. And I must admit that keyword could be misunderstood. But what word is better which is a common word and not a programmer nerd word. English does not have enough words to always be able to specifically describe something in ONE word. But if we add a type="word" and both type="keyword" and type="word" are documented properly I think we can contain the confusion to a reasonable level. People need to look at the SEARCH doc anyway to know what to add in a SEARCH tag.
I do not want to start a Priority field war. I just want to make sure noone "fixes" the current behavour when seeing this bug report but properly understand that it is adding an additional feature to the search to maintain backwards compatibility. Requirement does not mean low priority.
KJL
This 'feature' has been introduced in Cairo, and it has been implemented differently from the documentation. 10000s of search applications seems to be a very high estimate.
So we now have "keyword search" and "literal search" that both do the same thing. Or?
Peter writes about keyword:
For TWiki it is any set of number of characters, separated by space. That definition seems ok, as long as it means separated by space
on both sides. But clearly TWiki keyword search doesn't work this way, because it finds
variable when searching for
RIA.
Shouldn't this be repaired?
If people are using "keyword search" when searching for word parts, they are not following the search docs. In fact, if they would follow the doc, they would use 'literal search', not keyword search.
The default search type is "literal search" (defined in
%SEARCHVAR- DEFAULTTYPE%). So people have willingly set a search to keyword search. And for these people, their apps will be very easy to update.
AC
TWiki has been installed at least 1000 places. There are at least 10 topics with a search on each. In fact there are probably 100s of searches on each TWiki installation and from the download numbers there could be 1000s of TWikis out there. We have to think about our customers before we go and change things.
And you have seen some of the sharp reactions from customers both on the mailing list and on twiki.org when we change even smaller things.
Literal search is not the same as keyword search.
It is the "search string" where each keyword is seperated by spaces. That is what Peter is talking about.
When I search for "hund" on a Danish language site I expect to find "hund", "hunden", "hunde", "hundnene". The keyword search enables the use of + and -. If I search for "+hund -kat" I expect to find topics with "hunde" but not "katte". Many people will not see the current behavour as a bug. But I do see the need for the type="word" with the same spec as keyword but looking only for whole words. And I would probably also change the generic search in top bar and the standard search topics to be "word" instead of "keyword".
And may I remind that this bug was discussed at a release meeting and the people present agreed that it was better to maintain compatibility and extend the feature with type="word". This is not a Peter and Kenneth idea.
KJL
I repeat (because it doesn't seem to hit):
TWiki:Codev.KeywordSearchWithImplicitAnd
was created because "TWiki's search should behave like Google and other modern search engines".
What a total misunderstanding of the concept of "keyword". With Google you don't use a "keyword" such as
ria to find pages with
variable. Regardless of what
you expect, it has been implemented in the wrong manner and in that sense it is a real bug.
Please elude me on "literal search", because I couldn't find a definition for it.
Wikipedia:String_literal
only confuses me. But my intuition says literal search is "search for a phrase", as
written elsewhere
.
In the above example
SearchTest I have added a "literal" search query, and it outputs exactly the same result as keyword search. I think that with your example of
hund you could use literal search as well.
Note again that literal search is the default.
AC
--
ArthurClemens - 22 May 2006
Let's see the spec of SEARCH, as it's defined in
TWikiVariables and
SearchHelp.
From the former, on can read:
Do a [[http://dictionary.reference.com/search?q=keyword][keyword [keyword]] search like soap "web service" -shampoo
From the latter:
* Specify word(s) you want to find
...
* Example: To search for all topics that contain "SOAP", "WSDL", a literal "web service", but not "shampoo",
enter this: soap +wsdl "web service" -shampoo
From the user point of view, it means that looking for a keyword is looking for a key
word, not looking for a string embedded inside another string (that what regexp search is for, as documented).
So we have two paths:
- Remain "bug-compatible" so all the TWikiApps that where built taking advantage of a bug and works against the spec won't break.
- Fix the behavior of keyword
The later can be done in a compatible way. Instead of adding a new search type, we can add a "modifier" to the keyword search (
boundary="word/none"
?) so it either behaves like it behaves today or like it should behave. The default should be "how it should behave", but the upgrade document should state that there was a change with the shipped defaults, and if there is an installed app that depends on the old behavior they should change the switch.
That means that I vote we fix
keyword.
--
RafaelAlvarez - 22 May 2006
BTW, the same argument in favour of not fixing
keyword can be used against changing
ANY part of the Core because some of the 100s plugins out there that work around the Func API (so working against the spec) won't break. And I mean both changes in behavior
and structure.
--
RafaelAlvarez - 22 May 2006
Be careful not to assume knowing the single and only
truth about what keyword means and making old TWiki topics Gospel. Please try to understand our customers need for being able to trust TWiki as a stable product.
The type="word" is just as fine a definition for searching for whole words as type="keyword". It extends the spec of SEARCH without breaking anything.
You cannot just claim that searching for keywords MUST mean only whole words.
Let me elaborate on this for a while.
In my language Danish nouns made from two or more nouns are put together as one word. "Battery Adapter" becomes "Batteriadapter" in Danish. "Stock market" is "aktiemarked" in Danish made up from "aktie" and "marked". Normal users looking for keywords aktie expects to find "aktiemarked", "aktiekurs", "toneangivende aktier" etc and they do not expect to know anything about regular expressions to do that.
Literal search will not do it. It will not work for "+aktie -bank" to find pages with "aktie" but not bank.
The
KeywordSearchWithImplicitAnd specifies the definition of the search string and does not contain any accurate definition of how it actually searches. To define searching for whole words actually extends much further than a string with white space around.
What about looking for "Lavrsen". Shouldn't it find my name when it is part of a
KennethLavrsen?
What about "aktie-marked"? When I search for "aktie" on Google I get many hits with "aktie-marked" and "aktie-kurs". What about "(aktie)"? Or 'aktie' with the quotes? Or |aktie| in a table? Or when I search for Glostrup (my city) how about Glostruphallen? Or if I search for "Glostrup sport" shouldn't I find a topic with the text "sportsforretninger i Glostrup"? What in English is "a dog" and "the dog" is in Danish "en hund" and "hunden". No "the" word in front. We suffix the word with -en. We Danes often want to find "hunden" as well as "hund" when we search for "hund".
I can continue with "hund." "hund," "hund;", "hund2", "hund-", "politi-hund". What characters make up a word boundary and how does it work in an i18n environment? None of this was defined in
KeywordSearchWithImplicitAnd because this is not what that topic was about.
And even if it was originally expected to be a full word search by some of the original debaters whoever implemented the code found that the features should be implemented differently. You cannot say the code is buggy because it does find the words and it does not miss anything out or find something which is not there.
If the search code found things that are not there or missed words you could say that it is a bug.
You can argue for both ways of word searching to work. Noone owns the truth. My truth is a true valid as yours. You cannot come claiming that things that do not work according to your truth is a bug.
This is not a COS(X) function that returns a wrong value or a search that returns garbage characters or whatever shape a software bug can show themselves in. This is a simple discussion about how a feature should work and it is a feature which has been released with the current behavour 1.5 years ago and used by 10000s of end-users. Some fraction of these users rely on the current implementation of the search feature and there is absolute no good reason to start breaking anything for them when there is a very good alternative way to do it.
Why does things so often have to turn into a religious war? Why is compromize so difficult?
--
KennethLavrsen - 23 May 2006
It is
not reasonable to have a keyword search degenerate into a literal search. A keyword search for
net shouldn't turn up
KennethLavrsen, because there is no way to claim that net is a meaningful part of
KennethLavrsen.
ria should not find
variable or
striated or any other word that those three letters happen to be part of.
Yes, there's room for compromise. There isn't room for completely ignoring what keyword signifies.
--
MeredithLesly - 23 May 2006
We are not completely ignoring the "keyword" case. We just implement it with type="word" and leave the "keyword" working like today. That is the compromise that should satisfy all parties.
--
KennethLavrsen - 23 May 2006
If a keyword search finds
variable when searching for
ria, then it's implemented wrong. Period.
--
MeredithLesly - 23 May 2006
It is
KeywordSearchWithImplicitAnd itself where is written "TWiki's search should behave like Google and other modern search engines". You can have your interpretation of searching word parts, but this was the goal and it was not met.
--
ArthurClemens - 23 May 2006
Kenneth, I think that you misunderstood me. My point is less about the true meaning of keyword, and more about what the documentation says vs what the code says and how users should behave. Let me explain:
If the documentation says
A and the code does
B, then either the documentation or the code is wrong. So, how do we decide which one is wrong?
From the user point of view, if the documentation says
A and the system does
B then the
system is wrong and buggy. The user has the option to either raise a bug report or be silent and try to find a workaround.
We either make the latter users to be be very, very aware that their workaround is likely to break in a future release, or we are complacent and thus promote "bugs" to "features".
The other case is when the user finds an "undocumented" feature. We should make very clear that undocumented features are not guaranteed to work in new releases (even if we promote and document some of those), or we run the risk of having to maintain a poorly-made choice.
--
RafaelAlvarez - 23 May 2006
To find a compromise first we need to identify the oposing sides:
- One Side: Keyword searches should maintain their current semantics so users won't be affected
- Other Side: type="keyword" should be fixed. period. (screw the users?)
If these are the opposing sides, type="word" is not a compromise, is imposing the One Side.
I
did made a compromise. I proposed a change that "fixes" keyword behavior in a bug-compatible way.
So One Side get the current semantic and users won't be affected, and Other Side get type="keyword" fixed.
And everybody is happy. (or so I hope)
--
RafaelAlvarez - 23 May 2006
Interestingly, in that topic, PTh comments "I assert that most applications use a regular expression search, so the chance is small that this spec change breaks existing content."
But, yes, Arthur, you are quite right and spec was agreed to. It's unfortunate that apparently there's a serious bug in it. If there are installations relying on the bug, then they implemented things wrong.
As noted elsewhere, Google does not grok camelcase, which is unfortunate and, for all we know, may be remedied by them. I don't have a problem in extending the meaning of keyword to include complete portions of a wikiword, although I'm OK either way. It does seem a reasonable extension to allow searching for
Lavrsen to turn up
KennethLavrsen even if it isn't exactly what Google does. But, once again, pretending that keyword means literal and papering it over by adding a new kind of search isn't the right solution.
--
MeredithLesly - 23 May 2006
Kenneth: out of curiosity, where do you get "TWiki has been installed at least 1000 places"?
--
MeredithLesly - 23 May 2006
Googling for
aktie with Danish as the language turns up
aktie-blodbad,
aktie-dyk,
solenergi-aktie, etc. That is, it finds words, where a word is defined as something surrounded by white space, where the match is identical or a part of the word
that's surrounded by whitespace or punctuation, most notably dashes. If TWiki does not find "parts of words" in this manner, that is a bug according to the documentation. It shouldn't, however, find aktieanalyser or Aktieinfo or aktiema. (I mention these because they're in the third hit I get, and they're not in bold.)
If people have created keyword searches that rely on behaving differently than Google's behaviour, then they wrote broken searches
even if they happened to work due to a bug. If TWiki is to be considered reliable, it must fix bugs when they're found, and not leave them in because someone is relying on buggy behaviour.
--
MeredithLesly - 24 May 2006
Let's cool down. We agree that we disagree in the meaning of "keyword". It's clear that the code doesn't do what the documentation say it should. And there is the patent posibility that there are
TWikiApps out there relying on the bug.
Let's stop arguing and look at the posibilities/proposals we have, to decide what are we going to do.
- Fix
keyword to work as documented
- (Cons) May break some TWikiApps.
- (Pro) Works as expected from a traditional keyword search.
- Fix the documentation so it describe the current behavior
- (Pro) Some TWikiApps will still work.
- (Cons) Won't follow the "principle of least astonishment" for users accustomed to traditional keyword search.
- Add type="word" to the options, to perform the "keyword search" as documented, keep the current behavior of type="keyword"
- (Pro) TWikiApps will still work
- (Pro) Both embedded and whole words search will be easily performed
- (Cons) Won't follow the "principle of least astonishment" for users accustomed to traditional keyword search.
- (Cons) At first sight, there is no patent difference between type="keyword" and type="word", because type="keyword" doesn't convey the traditional meaning.
- (Cons) Add more options to the already complex %SEARCH% tag
- Fix type="keyword" to work by the spec, add a new option (boundary?) to choose between the "keyword search" as documented or the current behavior. The default for this parameter should be configurable.
- (Pro) TWikiApps will still work
- (Pro) Both embedded and whole words search will be easily performed
- (Cons) Add more options to the already complex %SEARCH% tag
- (Cons) Adds more settings to the already huge set of TWiki settings
Disclaimer: I'm assuming that the traditional meaning for keyword is the one found in the dictionary, which I assume is the one most people is used to.
--
RafaelAlvarez - 24 May 2006
Unless we change the documentation, we should be using the Google definition of keyword. This is, after all, the best way of determining what people expect, since Google is widely used.
That's why I experimented with Google, using one of Kenneth's example searches. In particular, a search should turn up a match in a hyphenated "word", as described above, not just words surrounded by whitespace. (Whitespace is wrong, anyway, because of punctuation; I assume that it was never intended to use only white space to find keywords.)
If it doesn't add a lot of hair to the search code, then it seems the best solution would be to fix the keyword search behaviour and add a global configuration option that would cause keyword search to behave the way it does now. That avoids further complicating
SEARCH while allowing sites that have used keyword search in the broken way to continue working. We should, however, also provide documentation on how to rewrite searches to conform to the spec.
--
MeredithLesly - 24 May 2006
It is clear this discussion is in a status quo. But we
can tackle this in a practical way (in the line of Meredith's last comment) - see
#Specification above.
Shouldn't
SEARCHDEFAULTTTYPE be in
configure?
--
ArthurClemens - 21 Apr 2007
Yes, adding
type="word" to search is a good solution.
I think SEARCHDEFAULTTTYPE should remain in TWikiPreferences since it might get redefined at a lower level.
--
PeterThoeny - 22 Apr 2007
Good to see this being worked on again and with a new spec.
Since this is an old rejected proposal the 14-day rule should not apply. When new spec is agreed please summarize and put forward for release meeting.
--
KennethLavrsen - 23 Apr 2007
I've added
#Implementation in progress at the top.
--
ArthurClemens - 23 Apr 2007
One detail that needs to be investigated: The Jump box feature top drill down to a topic by typing part of the name would no longer work if
type="word" is enabled at a site. That means we need to explicitly specify
type="keyword" in a number of queries, just to guart for cases where an admin changes the default from
type="keyword" to
type="word".
--
PeterThoeny - 24 Apr 2007
I have finished the implementation.
--
ArthurClemens - 02 May 2007
Great, Arthur!
I think the doc is pending in the description of
TWiki.TWikiPreferences's SEARCHDEFAULTTTYPE.
--
PeterThoeny - 08 May 2007