TWiki's grep based search is amazingly fast. At work we have over 800 topics in our knowledge base web and search is more then good enough. However, it can get a issue on a public site where hundreds of people access a site simultaneously. AFAIK, the TWiki installation at
JOS disabled search just because of that.
There are free search engines available that could be used to spider the TWiki webs.
Here are pluses and minuses compared to grep search:
- Plus: Fast.
- Plus: Flexible search with AND / OR combination.
- Minus: Time lag of indexing (e.g. topic updates don't show up immediately in search)
- Minus: Can't realize all TWiki options (regex for example, or KevinKinnell's latest additions)
I have not looked into installing a search engine, but as I heard you create some template files for the search engine. When you call a search script from a form, it generates a search result based on the input parameters and formatted using the templates. Ideally it should be possible to do it the other way around in TWiki: Create a replacement
wikisearch.pm script that calls a search script, capture the result in stdout, and then format it to the TWiki needs.
Anybody has experience with search engines?
--
PeterThoeny - 05 May 2000
Have you considered
http://search.cpan.org/search?dist=MMM-Text-Search
or something similar?
--
NicholasLee - 06 May 2000
Workin' on it in Pure Perl as a .pm and probably using one of the search modules on
CPAN. Won't be done for some time; gotta make a living too.
--
KevinKinnell - 07 May 2000
I have some experience with search engines. I've used ht://Dig on the LUF site and
SwishE at work. One problem that I've run into with ht://Dig was that the database grew to a very large size since it stores copies of the pages in the database. Also it finds words that are in the templates and not just the .txt files.
I'm thinking of switching to
SwishE since you can do a file system search to search the .txt files and map them into the TWiki URLs. Also since
SwishE is a command line driven program it should be directly integratable with the currently existing TWiki search script. Also, it supports searching multiple indexes which would allow searches for individual or multiple webs. It also supports boolean operators, wildcards, and can limit searches to particular fields (i.e. META, TITLE, etc.)
With the HTTP indexing feature, I can include other parts of the site or even index other sites to be included in the search parameters.
--
JamalWills - 09 May 2000
You may also consider SWISH++ (
http://homepage.mac.com/pauljlucas/software/swish/
) for indexing and searching. The httpindex part can be fed from command line - so there's no problem with only indexing .txt files or excluding some files. Think of the unix "find" command for generating file list.
I'm corrently working on some combination of searching TWiki webs with SWISH++.
--
MichaelSpringmann - 27 Mar 2001
I'd be very interested in your conclusions.
--
MartinCleaver - 27 Mar 2001
A generic indexing key:text_to_index package would be best for the long run. Consider
PackageTWikiStore and the comments at the end of
NativeVersionFormat on Web.Topic indexes.
--
NicholasLee - 27 Mar 2001
See also
SearchAttachments
--
MartinCleaver - 29 Mar 2001
After looking at the product listings in
SearchAttachments, I found that a good starting point would be:
http://www.perlfect.com/freescripts/search/
(full perl, as opposed to the other suggestions), however I think that any search engine has to be TWiki-ized, That is:
- Indexing has to be light weight, i.e. we cannot trigger full site indexing on every topic change, neither can we just wait for the next indexing event to happen.
- The web/subweb structure has to be mantained by the indexing mechanism
- The TWiki vocabulary of text mark-up and wiki words has to be understood by the search engine and the indexing mechanism (i.e. wikiwords themselves should be broken into component words)
- Meta information should be kept in the indexes to speed searches ie. revision/author information, initial set of lines...
- Stop words (non-indexing terms) should probably be dynamically generated by the indexing mechanism, and not pre-established (would reduce language and vocabulary dependencies)
The first point is the main one to me, and I have given some thought to this, so let me open them for comments:
- On every topic edit generate a temporal file that contains the current state of the topic (alternatively, this file could just contain the current indexing information for the topic)
- This could also be done by extracting the previous revision from RCS, however, not all topic revisions are being saved, as those that happen in the topic lock interval.
- On topic save, generate new indexing information, remove the old information from the indexes, and place the new information in.
- We would probably need both forward and backward indexes per web, to speed up the process.
- However: a global reverse word index for the site would simplify the actual search engine
- On some regular interval either re-index the site, or do some consistency check on the indexes, to ensure that no errors have crept into the databases.
Advantages: Much faster than trying to index the whole site, can be a background task after the actual topic save.
But there are a couple of stopping points that I can't figure out yet:
- What to do when two people try to edit the same topic simultaneously (this is actually allowed by TWiki).
Mmmm... I guess I figured this one out, keeping with the peer trust system we just need to re-generate indexes on topic saves, so we just look at the current topic before the file is actually saved [ EB: 4/3/01 ]
- An index locking mechanism has to be put in place, which begs the question: what to do when we need to re-index a topic but it is locked?
Comments???
--
EdgarBrown - 30 Mar 2001
See
MultipleCommitPathBehaviour which will define minimum spec for the save-path part of the edit-preview-save cycle.
--
NicholasLee - 03 Apr 2001
We are looking at integrating Ht:dig with TWiki. The primary objective is to provide searching on attachments - this and attachment rev control mean that TWiki is an effective document repository (that actually gets used!). Rather than try and replace the native TWiki search, the idea is to fire both searches in parallel and then to integrate the search results pages.
Advanced search can include the (3) htdig controls and allow users to choose which tool to use.
The question is what to display as the default - is it best to exclude topics from the ht:dig collection, just to show ht:dig and live with the fact that it indexes overnight, or to somehow include both.
Can anyone offer advice?
--
SteveRoe - 01 Jun 2001
The advantage of using something like
ht:dig
is that we're not adding the complexity of indexing etc to TWiki. But, what are the disadvanges and do they matter:
- Users have to install both TWiki and a search engine
- Unless some integration is done, the search engine is likely to be out of date by up to say a day.
--
JohnTalintyre - 01 Jun 2001
JamalWills said last year (above) that
SwishE might be a better option. Has anyone got experience with that?
--
MartinCleaver - 01 Jun 2001
John / Steve / Anyone else - did you get htdig integrated with TWiki? On Windows or on UNIX?
I've got a member of staff who can investigate this this week, and I'm finding that having no ability to search attachments is a major impediment on a web with 400 topics but over 3GB of data! Many thanks.
--
MartinCleaver - 02 Oct 2001
At work we have htdig running alongside TWiki - we've done no integration. Users have a choice of TWiki integrated search or faster htdig search that also search some attachment types (including Word an PDF files).
--
JohnTalintyre - 03 Oct 2001
Aha. I see, thanks. Are you running on NT or UNIX? Have you created a [.Topic] in TWiki from which to invoke the htdig search?
--
MartinCleaver - 03 Oct 2001
Running unix, modified
WebSearch to include ht/dig.
--
JohnTalintyre - 03 Oct 2001
Great, thanks. I'm going to get someone to try it on Windows! (Figures crossed!!)
- Are you ht/digging only the attachment space or do you do the do the topics as well?
- Once you have found something in an attachment, do you relate it back to the topic? If so, how?
--
MartinCleaver - 04 Oct 2001
Just a note -- I found a new (to me) search engine:
- Namazu 2.0
-- at a quick glance it appears to do everything I want to do except proximity searches. There are some confusing things in the documentation (something like "can't search on another computer" -- but I suspect this is just a translation problem, especially as the web site can be searched using namazu 2.0.5) -- the product is written by Japanese.
I'm not sure this will be useful -- the thing that intrigues me is that one of the search engines is namazu.cgi which somehow makes me (as a non-expert in cgi, html, perl, etc.) suspect that maybe this will be convenient to integrate into TWiki. This is at least partially a note to remind me to investigate further.
--
RandyKramer - 27 Nov 2001
Namazu looks like a very useful Intranet tool. I've managed to get the windows binary version half-way working on Windows2k. There are limitations which I'm sure a programmer could fix fairly easily:
- MSOffice must be installed on the local computer to index .doc & .xls files - Namazu uses Win32::OLE by default. I see code to use wvWare (as well? instead?) but I don't know how to get that working.
- MSO documents being indexed must be on the local computer - after processing ~100 remote documents I get a crash in winword.exe after which no further office documents are indexed. (There does not appear to be any problem indexing remote .txt or .html files.)
Other stuff might be harder to fix. For instance indexing MSO documents is slow: 70 minutes to index 1800 local documents totalling 370 mb on a dual
PIV800MHz with 512mb ram. Maybe using wvWare instead of OLE would help here.
| |
Total |
indexed |
| files |
1826 |
1451 |
| size |
373mb |
72mb |
Rough RAM use breakdown while indexing:
| perl.exe |
52 mb |
| winword |
15 mb |
| excel |
8 mb |
CGI installation was a snap:
- copy
namazu.cgi.exe to /cgi-bin/ (there are no other namazu files on the webserver machine)
- edit
/cgi-bin/.namazurc (based on $namazu/etc/namazurc-sample, only 2 changes were necessary)
- add path to index files
- add a Replace line so the search results are clickable
- point browser to http://mywebserver/cgi-bin/namazu.cgi
and it "just works"
--
MattWilkie - 28 Nov 2001
Wonderful! (Please tell me you were working on this before you saw my note, for the sake of my sanity!)
I'll have to look into this, but it sounds simple enough. In my case I'd be indexing the (TWiki .txt) files on the webserver, so I would have the indexing programs there as well, probably run by a cron job. (My (private) web server is on Linux, and I want to put a public site on
SourceForge, so, for example, the ".exe" is not applicable, but I suspect the installation would be almost identical except for those kinds of things.)
Thanks for the feedback, and, if you did try this after seeing my note -- thanks for the effort!
PS: Does it look like the search feature could be incorporated on a TWiki template? (To keep a consistent look and feel)
--
RandyKramer - 28 Nov 2001
Keeping a consistant look and feel should be easy enough. Namazu already uses seperate header and footer files which are just simple html fragments (kept in the same directory as the index), and the script which genarates the index has a "--with-template" argument.
One other thing I forget to mention is that the cgi is very fast, the search results come up faster than a regular twiki page.
As for your sanity, sorry, I can't help there because I traded mine in years ago. : )
--
MattWilkie - 29 Nov 2001
Matt, thanks for the follow up!
--
RandyKramer - 30 Nov 2001
Has anybody looked at GNU bool (
http://www.gnu.org/software/bool/bool.html
or
ftp://ftp.gnu.org/pub/gnu/bool/
)?
It supports boolean expressions of the form:
(sanity and there) or because
(sorry near help) and gnu
No regexp support unfortunately. While I agree there are some usability issues with booleans, this seems like it might be a good option/plugin for use in the simple search.
I ran
/tools/bool-0.2/bin/bool.exe -l -i 'fragment and space' *.c *.h
in the bool src directory and came up with:
bool.c
context.c
bool also understands
HTML 4 and can deal with the character entities (or so they claim).
Now how do plugins affect the search pages? Maybe time for a
SearchPlugins topic?
--
JohnRouillard - 01 Dec 2001
Thanks for the pointer -
bool is really impressive, it just works out of the box with TWiki, and it's very easy to build (
./configure; make; make check; make install). Its interpretation of 'near' is good, too - it treats two newlines as a new paragraph and only considers words within a paragraph as 'near'.
I just tried this out by changing
Twiki.cfg to point
$egrepCmd and
$fgrepCmd to
bool. The only thing remaining is to change TWiki so that it knows it is using
bool and changes syntax such as
easy wiki into
easy NEAR wiki, and
"easy wiki" into a phrase-matching search (just apply bool's
-F flag).
I previously implemented AND searching in a simple script called
andgrep (see
CategorySearchForm), but
bool is much better for phrase searching.
andgrep is still useful for form searching based on Field1 = foo AND Field2 = bar, though, since
RegularExpressions are needed.
I'm adding this page to
InterfaceProject since the TWiki search feature is one of the key usability issues I hear about from TWiki users in my company. I'd like to aim for something like Google searching, in its use of proximity searching even if not in intelligence. Ranking search results based on best match is important as well.
Ideally we could support two search options, keeping an identical search syntax at the user level and for embedded searching:
-
bool or similar for out-of-the-box searching on small to medium size TWiki sites
- an OpenSource search engine for larger sites with more data and users
- SWISH++, at least, has incremental indexing (i.e. it can add just one page to the index), so it should be possible to index when saving a topic (or perhaps after a short delay if this is a slow operation).
I don't think plugins make much difference to searching, as long as the raw topic text has meaningful keywords -
bool type searching happens on the *.txt files not the
HTML output, of course.
--
RichardDonkin - 09 Feb 2002
I ended up using Perlfect Search, mostly because it was written in Perl, and that's what I do best. It did require a bit of hacking though because it is not a web spider, so it sees the source of all of the .txt files. I also needed it to index MS Word docs (it only comes with PDF support). Here is a list of what I changed in the sources:
- Added Word support to the indever by using "catdoc", a Unix based conversion utility (http://www.ice.ru/~vitus/catdoc/
).
- Hacked the indexer to ignore meta info.
- Had the indexer call TWiki functions to convert the topic to HTML before indexing. This allows it to use heading tags for weighting and also makes sure that includes and embeded searches get indexed as part of a given page.
- Toying with the idea of adding Excel support. This can be done with a utility called xsl2csv which comes as part of more recent catdoc distributions. This would be easy to add though.
- Had to post process file paths discovered by the search indexer so that the .txt extension would be dropped and the pages would be opened through the "view" script.
- Perlfect only supports a single path to the files, and my attachments are not stored with the data, so I needed to hack that in support for that as well.
I was pretty impressed with the results of the searches. The hits were right on, clips from the pages were displayed in the results with the search terms hilighted (including Word and PDF docs), you can perform searches to exclude terms, and it was easy to use built in Perlfect params to limit a search to a single web.
Granted, it doesn't do everything... like searching for
WebForm data. So I still use the old search, mostly for embedding topic lists in pages based on a form, but that is about it. In all it was a great step up from the grep search, and finding what I really need is a breeze. Best of all, it's written in Perl, so it should be easy to make further enhancements.
...After reading this page, I figure I ought to check out htdig though, sounds interesting.
BTW - If anyone is interested in my Perlfect search hacks I can upload the changes. Just to warn in advance, the changes aren't well documented and my emphasis was on getting it to work the way I wanted, not to make the code look pretty.
Cheers.
--
RobertHanson - 17 Jul 2002
I'd like to see the changes. Have you considered turning them back to the Perlfect people?
--
JohnRouillard - 18 Jul 2002
I'll try to get the changes together later today. As for the Perlfect people, I don't know that they would be interested in it at this point, it was a fairly quick hack and needs a lot of cleaning. I also started to make their code object oriented just to make it a little easier to work with... so there is still lots of work/cleaning to be done.
--
RobertHanson - 18 Jul 2002
Well, I keep positing bad news in the form of bugs here, so I thought I'd try something different.
Here is a positive usefull addition to twiki for those who want a bit more kick in the simple search.
I have uploaded a perl wrapper for the bool search tool see
#BoolPrevDisc on this page for further info.
This does what
RichardDonkin asked for. It turns search requests like
twiki problem crash into
twiki and problem and crash. It also makes
"teamwork tool" twiki into
"teamwork tool" and twiki
which implements a phrase search for the words
teamwork tool right next to each other
rather than just returning any topic that has the words teamwork and tool somewhere in its text.
It also allows more complex boolean expressions like:
twiki and (fix near loss) not rouillard to
allow you to find all pages with the word twiki, the word fix within 10 words of the word loss and that
doesn't have my name on them 8-).
Bool does have a problem with searching stdin. It tries to provide lines of context around the match, and
this doesn't work because it prints multiple file names (and partial file names up to 60 characters of context)
around the matching line. So my wrapper falls back on fgrep for searching topic names.
I've only had this running since this afternoon, but I have gotten a few caveats:
not rouillard does not work as expected. Something like
e not rouillard does
do the trick, I guess bool want to be positive, and match something. I have tried to do a
good job of cleaning the search string, but as always with my stuff use at your own risk, no
warantee implied or provided etc.
I installed it like so:
- install bool
- download bool from ftp://ftp.gnu.org/gnu/bool/
- build it for your platform (bool build OOTB on cygwin)
- install it
- download the wrapper called boolwrapper.
- make sure the shebang line point to your perl installation.
- change the paths in the wrapper to point to your fgrep and bool install. My path to bool is almost certainly wrong unless you are running a Depot-Lite
software installation.
- install the wrapper in some directory.
- in TWiki.cfg, change your $fgrepCmd to point to the wrapper.
I didn't make it take over the role of egrep because I really wanted to keep regular expressions
available for those that know how to use them.
On another topic, has anybody tried using perl in slup the entire file mode for applying regexps? Might be
useful for patterns that can extend over multiple lines.
--
JohnRouillard - 07 Aug 2002
This looks very useful - I was going to do something similar but work intervened. As you probably know, TWiki does now have
SearchWithAnd built in, but the syntax you've implemented here is much better since everyone knows it from Google.
It would be good to map the basic
"teamwork tools" twiki type syntax onto
SearchWithAnd, which should not be too hard, so that there is an out-of-the-box improved search syntax (perhaps controlled by a new parameter on %SEARCH%?).
bool is more flexible for complex boolean searching of course, and good in terms of performance for complex searches as it only reads each file once, whereas
SearchWithAnd must launch
grep several times and re-scan at least some files.
The other thing that would be great is relevance ranking - not sure if
bool can do this, but ranking requests higher if they have multiple hits for search terms would be useful. Ultimately it would be good to use something like Google's relevance ranking (see
Google:Google+PageRank
), but that's probably patented and would require a batch index building process - so a simple ranking based on no. of search terms 'hit' would be useful.
--
RichardDonkin - 07 Aug 2002
Well, I don't know about ranking, but bool (and grep) will return a line count (-c option) that could
be used if the TWiki core made use of it. I was also thinking about changing the core so that it
parses the output of the grep and uses it as context. E.G. grepping for "user's password" returns (with
output truncated to make page readable):
AppendixFileSystem.txt:| =.htpasswd= | Basic Authentication (htaccess) users file with username and encrypted password pairs |
InstallPassword.txt:an encrypted password generated by a user with ResetPassword.
InstallPassword.txt:After submitting this form the user's password will be changed.
MainFeatures.txt: * *Managing users:* Web-based [[TWikiRegistration][user registration]] ...
TWikiInstallationGuide.txt: *
*NOTE:* When a user registers, a new line with the ...
TWikiRegistrationPub.txt:To edit pages on this TWiki Collaborative Web, you must have a ...
TWikiUpgradeGuide.txt: * *[[TWiki06x01.TWikiUserAuthentication#ChangingPasswords][Change passwords]]* ...
parse out the file names, use the count of lines as a ranking and use the rest as lines of context in a formatted search. E.G. context=120 would show 120 characters of returned info for each file in addition to (or maybe replacing) the summary depending on the search.
An example output (ascii format) with sortby=score context=160 would look like:
InstallPassword(2) 14 Dec 2001 - 02:42 - NEW AndreaSterbini
Install an Encrypted Password This form can be used only by the MAINWEB .TWikiAdminGroup
users to install an encrypted password generated by a user with ResetPassword ...
>an encrypted password generated by a user with ResetPassword.
>after submitting this form the user's password will be changed.
AppendixFileSystem(1) 18 Jul 2002 - 07:08 - r1.10 PeterThoeny
TOC STARTINCLUDE #FileSystem # Appendix A: TWiki Filesystem Annotated directory
and file listings, for the 01- Dec-2001 TWiki production release. Who and What is This ...
> | =.htpasswd= | Basic Authentication (htaccess) users file with username and encrypted
password pairs |
so the entry for
InstallPassword floats to the top since it had the most lines returned,
the number of hits is indicated by the number in parens on the top line. You could even (for gnu
type greps) use the
-C # option to print additional context lines around the matching line.
For bool, you would use
-O #= for the number of lines and ==-C # for the number of characters
in the context lines.
This would make it a bit more useful. Just removing the -l from the grep and bool command lines will
allow us to do a "poor man's" context. Does this could like a good addition for the core TWiki?
Also does anybody think I should update the boolwrapper to allow ; to mean and?
--
JohnRouillard
Just a note: The Glimpse search engine handles regular expressions, misspellings, and ranking of results. Their engine works pretty well through an "agrep" command line utility which would make TWiki integration relatively painless. You can find them here:
http://www.webglimpse.net
and here:
http://www.webglimpse.org
- Free for non-commercial use and also for those who will help develop and test.
--
TomKagan - 24 Sep 2002