There seems to be much discussion of search issues but no index / gateway topic to all the issues which is prehaps why the duplication is occuring. So I have created the following table.
I think some of the topics on search functionality need refactoring. There seems to be much duplication of effort, and combining of issues. For Example:
KeywordSearchWithImplicitAnd covers two issues, one of which also appears to be discussec in
SearchSuggestion,
SearchIsBroken, and
SearchEnhancementsWithAndOr.
--
SamHasler - 10 Feb 2003
Number of topics: 236
Search enhancements
- Replace the grep search with a search engine.
--
PeterThoeny - 06 Sep 1999
A good search engine is something that every site needs. Dynamic sites like Twiki-powered ones cannot rely on the services of external search engines for both speed and up-to-date-result reasons. All the dynamic sites I know of, including twiki, use some kind of home-grown search engine. Surely out of all this work, someone, some where, has built this wheel in a portable format already.
Is there really no Perl::AwesomeGooglifyYourOwnSite module available that twiki could use? All of the problems mentioned in the above categories have already been solved. Why does twiki need to solve them again?
I am not a programmer, maybe I just don't understand the implementation issues and this is really a
Google:Wicked-Problem
, but I really think more stuff could be leveraged from other projects.
--
MattWilkie - 10 Feb 2003
From the table above the topics related to search engines and logic in searches are:
I'm brainstorming some ideas for giving searches extra functionality so that they can return more search engine like results. I think for a search such as "blue green" what people new to TWiki expect are pages with "blue AND green", they assume an implicit AND (
KeywordSearchWithImplicitAnd).
I think people would also expect pages to be ordered based on relevance, i.e. how many times the search term appeared in the Topic (
SearchOrderByRelevance). I think the basis of it might be using
grep -c to get a count of the number of times a pattern is found in a file.
I'm also thinking about a kind of TWiki page rank (
SearchWithTopicRank) that scores pages based on how many references there are to them, or how many children they have. Although if this has to be done on the fly it may be too slow to be useful.
So we have a road map for getting from the current state (
SearchWithAnd) to more search engine like feature.
I suggest that each be fully implemented before moving on to the next, and that they should each have to be specificaly turned on using a parameter, i.e. off by default so that all the %SEARCH% tags already written don't have to be changed.
--
SamHasler - 12 Feb 2003
Sam, I agree that some of the search topics need to be rearranged. Following your proposal on
SearchSuggestion I was about to create
SearchTopicAndText as
I wanted to separate the
two problems (1.
search topic and text , 2.
KeywordSearchWithImplicitAnd) discussed in
SearchSuggestion. But then I figured that in this special case the issues are connected and should be solved together.
My reasoning is: The title of a page (i.e. the TWiki topic name) is a component of the page. Thus if you do a
simple search it should be regarded and
included in the search as well.
Additionally to help users,
simple search should return as many pages as possible. To do this one has to make the
search set as big as possible. Adding
the topic names would increase the
search set.
We also have to keep in mind that the average users does not know how the TWiki search function is implemented and
thus he/she does not know that the internal data structure separates the topic title from it's content. These ramifications do exist for the programmer but as they are hidden to the users we have to overcome them.
If you agree on that we should follow-up on
KeywordSearchWithImplicitAnd.
--
DanielKabs - 12 Feb 2003
My reasoning for wanting two topics is that it's easier to discuss and implement them separately. It also means that anyone looking for one or the other set of functionality would have a descriptive topic name and only the information they are interested in. There are already patches for both issues in separate topics. As long as both of those patches make it into the core does it matter to you if they are handled separated or together?
--
SamHasler - 12 Feb 2003
I've already argued why those problems are so closely related that they should be implemented together.
You could even say it's only one issue because it's all about
improving TWiki's search interface.
Therefore I recommended to discuss both problems in one topic. In Star Trek parlance:
The need to get both features implemented outweighs the need to reduce combining of issues.
If we have two separate topics
KeywordSearchWithImplicitAnd and
SearchTopicAndText, how do we express their close relationship (if you agree on
that "close relationship"): Link one from the other or create a superior topic that refers to both?
--
DanielKabs - 13 Feb 2003
You are right, Daniel. Both features need go to core. Both are just Twiki quirks. Average users not only do not know how search is implemented: they cannot care less why Twiki decided to handle search differently. They compare search features in Twiki with experience on other web sites and Google, and all diferences will be counted as Twiki bugs.
RuleOne: If Twiki does the search as other do, it's intuitive and no further explanation is needed. Could be documented, and each reader will just say: "yeah, fine, no surprises here".
--
PeterMasiar - 13 Feb 2003
When I said implemented, perhaps I meant documented. We should discuss and document them separately, so that they are in topics with meaningful names and people can find patches for applying to old instalations.
--
SamHasler - 13 Feb 2003
I'm not sure if this is the right topic (please move it if there is a better place), but I have made my own search that works rather well. Here is a description of how it works:
My search ANDs all the search terms together, and it uses the command line tool "bool" instead of grep. It uses proximity searching, so all of the search terms must occur within 60 words of each other. Then rather than displaying results by individual webs one web at a time, all of the results are displayed together, with the web name next to the topic name. Several factors are taken into consideration when sorting the results. Each page gets sort of a 'score' or 'ranking'. This number is a function of how many times the search terms appeared in the topic, how many views the page has (taken from
WebStatistics), and how many search terms are in the title of the topic. The number of page views has the least impact on the score, and the number of search terms in the title has the greatest impact.
WebStatistics and
WebPreferences are automatically given scores of zero, because they are not usually what you are looking for, but they come up in the results frequently. Each individual search result also includes the first 150 word block from that topic that had all the search terms in it, wich the search terms in bold. This way you can see if that topic uses the words in the correct context.
At the top of the page, above where your search results are, it shows the number of times each term was found. It also shows the search string that was actually used. If you type in a search string of several terms, and one of them isn't found anywhere at all in the twiki, then the entire search will fail because none of the other search terms will be proximate to a non-existant word. Because of this, terms in your search string that don't occur anywhere are automatically dropped. So if you type in a search for "PI#JE Installation
EJLWKFUISAJ31JK Guide", it will actually use "Installation Guide". The user interface is a text box, "Enter Search Words", a "submit search" button, and a link to the regular search if people want it. It's very basic, and not intimidating to the timid user.
It has been surprisingly good at putting exactly what we're looking for right at the top of the list. However there are a few drawbacks. In my implementation, it is an extra sub added to Search.pm, which some people don't like to do. It can be slow sometimes (but we have a rather large twiki installation, so that might be the cause), and it's large (~1000 lines, but lots of comments). It requires the installation of bool.
I'll have to talk to my supervisor about it, but I don't think it will be any problem for me to post the code here in a day or two. My code might be a little ugly because I started out by copying the searchWeb function in Search.pm, and then I just modified it a bunch. So it's probably not how it would look if I had started this from scratch. searchWeb is still in Search.pm though, so the normal search still works. Anyway, this new search tool has worked great for us so far, just thought you people might be interested.
--
DavidSachitano - 02 Jul 2003
I'm interested in your search modifications David. If you haven't discovered them already you might find
PhotonSearch and
GoIsSearch intersting.
As for what might be the proper place for this discussion,
BoolSearch seems appropriate.
--
MattWilkie - 17 Jul 2003
I'm currently playing around with the ability to put the parent of a found item in the formatted output - perhaps this would be useful in the next release?
--
PaulPetterson - 19 Jul 2003