Pure Perl Backend for Search
On
SingleEntryPointForSystemCalls,
KennethPorter wrote:
"I don't see why fgrep is needed at all, given that Perl is a superset of grep"
The answer was simple: performance.
Thanks to
SamHasler for providing the link to Perl Power Toys, that has a perl implementation of grep, I adapted the
SearchDotPm module to use it instead of fgrep, and then ran the
AthensMarks benchmarks to see the truth in the statement. See the results below:
Cairo
TWiki core code benchmarks (TWiki/WhatIsWikiWiki)
| Release |
Skin |
Plugins |
Time per page |
AthensMarks |
| base/athens |
|
DefaultPlugin, InterwikiPlugin |
1.795 |
100 |
| base/cairo |
classic |
DefaultPlugin, SpreadSheetPlugin, CommentPlugin, EditTablePlugin, InterwikiPlugin, RenderListPlugin, SlideShowPlugin, SmiliesPlugin, TablePlugin |
3.66 |
49.0437158469945 |
| develop/searchcairo |
classic |
DefaultPlugin, SpreadSheetPlugin, CommentPlugin, EditTablePlugin, InterwikiPlugin, RenderListPlugin, SlideShowPlugin, SmiliesPlugin, TablePlugin |
3.83 |
46.8668407310705 |
TWiki core code benchmarks (Main/WebHome)
| Release |
Skin |
Plugins |
Time per page |
AthensMarks |
| base/athens |
|
DefaultPlugin, InterwikiPlugin |
1.975 |
100 |
| base/cairo |
classic |
DefaultPlugin, SpreadSheetPlugin, CommentPlugin, EditTablePlugin, InterwikiPlugin, RenderListPlugin, SlideShowPlugin, SmiliesPlugin, TablePlugin |
5.7 |
34.6491228070175 |
| develop/searchcairo |
classic |
DefaultPlugin, SpreadSheetPlugin, CommentPlugin, EditTablePlugin, InterwikiPlugin, RenderListPlugin, SlideShowPlugin, SmiliesPlugin, TablePlugin |
4.895 |
40.3472931562819 |
TWiki core code benchmarks (Main/WebChanges)
| Release |
Skin |
Plugins |
Time per page |
AthensMarks |
| base/athens |
|
DefaultPlugin, InterwikiPlugin |
5.125 |
100 |
| base/cairo |
classic |
DefaultPlugin, SpreadSheetPlugin, CommentPlugin, EditTablePlugin, InterwikiPlugin, RenderListPlugin, SlideShowPlugin, SmiliesPlugin, TablePlugin |
6.9 |
74.2753623188406 |
| develop/searchcairo |
classic |
DefaultPlugin, SpreadSheetPlugin, CommentPlugin, EditTablePlugin, InterwikiPlugin, RenderListPlugin, SlideShowPlugin, SmiliesPlugin, TablePlugin |
6.905 |
74.2215785662563 |
DEVELOP Branch
TWiki core code benchmarks (TWiki/WhatIsWikiWiki)
TWiki core code benchmarks (Main/WebHome)
TWiki core code benchmarks (Main/WebChanges)
Discussion
Attached to the topic are the Grep implementation and the modified Search module.
--
RafaelAlvarez - 19 Nov 2004
Rafael, could you please explain:
- what the release column means. You lost me. Heading cairo, release develop/search ???
I think the discussion
SecurityAlertExecuteCommandsWithSearch showed that the current
grep implementation is some handycapped as it is not calling
grep directly, but starts a shell which then starts grep. So speedup is possible - for example using the fix of
KennethPorter should
speedup the
and make it safer. I have not found in the topics you benchmarked, so I guess
grep is called somewhat differently. Are there shell calls involved too?
--
FrankHartmann - 19 Nov 2004
Ok, let me explain:
Under the Cairo header, base/cairo is the unmodified Cairo distribution, and develop/searchcairo is the patched cairo.
Under the DEVELOP header, checkout/DEVELOP is some version of the DEVELOP branch (don't have the number at hand

) and develop/search is the patched version.
About the topics:
- WhatIsWikiWiki don't have a search. It's like the "control group" to test that everything is fine.
- Main includes SiteMap which in turn uses
- WebChanges includes WebChanges, which in turn calls the search script that calls the TWiki::UI::Search module that calls the TWiki::Search module, the one I patched.
So, the last two topics go through
SearchDotPm.
I'll patch my installations with the fix in
SingleEntryPointForSystemCalls and rerun the benchmarks.
--
RafaelAlvarez - 19 Nov 2004
New Benchmarks:
TWiki core code benchmarks (Main/WebHome)
TWiki core code benchmarks (Main/WebChanges)
- base/trunk: MAIN branch, revision 3238 patched with KennethPorter patch
- base/DEVELOP: DEVELOP branch, Revision 3238 (unpatched)
- develop/trunk: MAIN branch, revision 3238 modified to use the perl implementation of grep
- develop/DEVELOP: DEVELOP branch, revision 3238 modified to use the perl implementation of grep
Can someone verify these results with bigger webs, please?
--
RafaelAlvarez - 19 Nov 2004
It would be useful to see MAIN r3238 unpatched, to separate the effects of the exec patch from the DEVELOP versus MAIN performance.
But it looks like, for a small web, the lower efficiency of an in-process grep is more than offset by the elimination of the separate process.
A large web would tell us more about the comparative performance of the built-in grep versus the external program.
--
KennethPorter - 19 Nov 2004
Results for a regex search across 5.000 topics, three of which match the search.
TWiki core code benchmarks for
BigWeb/BigSearch
--
CrawfordCurrie - 20 Nov 2004
Any idea why Athens is getting such good times?
--
KennethPorter - 20 Nov 2004
I have 3 (of course, I cheated: I'm looking at the code

) :
- SearchDotPm in DEVELOP doubles in size it's counterpart in Athens
- In DEVELOP, there are at least 4 calls to handleCommonTags
- There is a lot more pattern matching and extracting in DEVELOP than in Athens
That's just a quick scan. I'm sure that a deeper review will show us more.
--
RafaelAlvarez - 20 Nov 2004
I presume that the phrase "regex search across 5.000 topics" is the normal cross-cultural confusion over whether "." or "," represent a thousand or a decimal point. I'm assuming that the test is significant so there are five
thousand topics.
I do wonder, though, if the size of the file and amount of "near misses" (i.e. the regex engine almost makes it but still does its cranking) is significant. I also wonder if the number of matches and the number of matches per topic is significant.
The issue, Kenneth, isn't why Athens gets good time; its a smaller code base, does less work, compiles faster. As you point out, the issue is 'what is the turn over point' where the number of files makes up for the cost of spawning a number of separate processes. I say "a number" becuase there is a limit on the size of the command line. Somewhere along the line more than one
grep gets spawned.
Anyone for polyvariate analysis?
--
AntonAylward - 20 Nov 2004
Perfomance is critical mostly for large webs where improvements are noticable more then in smaller webs. Webs with 10K topics are not uncommon in a corporate environment. I recommend to base the benchmarks on a web with 10K topics and an average topic size of 4KB.
--
PeterThoeny - 21 Nov 2004
I realize that the shell has a command line limit. Does exec also have an argument length limit? (Recall that it takes an array of arguments.) If not, then skipping the shell removes that limit as an issue. (It's probably still an issue on Windows, where each program is responsible for parsing its own command line.)
--
KennethPorter - 21 Nov 2004
The current Search implementation accounts for the limit, it calls the external grep in a batch of 512 topics.
--
PeterThoeny - 21 Nov 2004
I believe the main limit on command length is the
exec system call, on most systems.
--
RichardDonkin - 21 Nov 2004
The numbers I gave before were wrong. The
AthensMarks are a bit zany because Athens doesn't actually work; the search returns no topics. I assume this is because of the arguments length issue. To keep everyone happy, I reran:
TWiki core code benchmarks for
BigWeb/BigSearch, (search of 10,000 topics, average size 3858 bytes, 10 topics match the search)
i.e. pure perl search is 44% of the speed of grep search in this case.
Note that when there are a lot of search results, the performance is totally swamped by the time taken to format the search results for display. Something I guess google discovered a long time ago.
--
CrawfordCurrie - 21 Nov 2004
WebStatistics shows this is a very popular topic (June 2005)...
--
MartinCleaver - 15 Jun 2005
To spead up search I think the best strategy is to install
DBCachePlugin. I have observed that on complex searches there is a 50% speed improvement.
I am currently working on trying to rewrite the
FormQueryPlugin interface to be as close to %SEARCH% as feasible....
--
ThomasWeigert - 15 Jun 2005
For those who don't mind compiling C code (as part of a Perl extension), see
NativeSearch, which removes the dependency on
grep and is quite efficient, though it may be a bit harder to install, depending on your platform.
--
RichardDonkin - 01 Jul 2007