Tags:
create new tag
view all tags

Pure Perl Backend for Search

On SingleEntryPointForSystemCalls, KennethPorter wrote:

"I don't see why fgrep is needed at all, given that Perl is a superset of grep"

The answer was simple: performance.

Thanks to SamHasler for providing the link to Perl Power Toys, that has a perl implementation of grep, I adapted the SearchDotPm module to use it instead of fgrep, and then ran the AthensMarks benchmarks to see the truth in the statement. See the results below:

Cairo

TWiki core code benchmarks (TWiki/WhatIsWikiWiki)

Release Skin Plugins Time per page AthensMarks
base/athens   DefaultPlugin, InterwikiPlugin 1.795 100
base/cairo classic DefaultPlugin, SpreadSheetPlugin, CommentPlugin, EditTablePlugin, InterwikiPlugin, RenderListPlugin, SlideShowPlugin, SmiliesPlugin, TablePlugin 3.66 49.0437158469945
develop/searchcairo classic DefaultPlugin, SpreadSheetPlugin, CommentPlugin, EditTablePlugin, InterwikiPlugin, RenderListPlugin, SlideShowPlugin, SmiliesPlugin, TablePlugin 3.83 46.8668407310705

TWiki core code benchmarks (Main/WebHome)

Release Skin Plugins Time per page AthensMarks
base/athens   DefaultPlugin, InterwikiPlugin 1.975 100
base/cairo classic DefaultPlugin, SpreadSheetPlugin, CommentPlugin, EditTablePlugin, InterwikiPlugin, RenderListPlugin, SlideShowPlugin, SmiliesPlugin, TablePlugin 5.7 34.6491228070175
develop/searchcairo classic DefaultPlugin, SpreadSheetPlugin, CommentPlugin, EditTablePlugin, InterwikiPlugin, RenderListPlugin, SlideShowPlugin, SmiliesPlugin, TablePlugin 4.895 40.3472931562819

TWiki core code benchmarks (Main/WebChanges)

Release Skin Plugins Time per page AthensMarks
base/athens   DefaultPlugin, InterwikiPlugin 5.125 100
base/cairo classic DefaultPlugin, SpreadSheetPlugin, CommentPlugin, EditTablePlugin, InterwikiPlugin, RenderListPlugin, SlideShowPlugin, SmiliesPlugin, TablePlugin 6.9 74.2753623188406
develop/searchcairo classic DefaultPlugin, SpreadSheetPlugin, CommentPlugin, EditTablePlugin, InterwikiPlugin, RenderListPlugin, SlideShowPlugin, SmiliesPlugin, TablePlugin 6.905 74.2215785662563

DEVELOP Branch

TWiki core code benchmarks (TWiki/WhatIsWikiWiki)

Release Skin Plugins Time per page AthensMarks
base/athens   DefaultPlugin, InterwikiPlugin 1.81 100
checkout/DEVELOP classic DefaultPlugin, InterwikiPlugin 2.52 71.8253968253968
develop/search classic DefaultPlugin, InterwikiPlugin 2.62 69.0839694656488

TWiki core code benchmarks (Main/WebHome)

Release Skin Plugins Time per page AthensMarks
base/athens   DefaultPlugin, InterwikiPlugin 1.985 100
checkout/DEVELOP classic DefaultPlugin, InterwikiPlugin 4.08 48.6519607843137
develop/search classic DefaultPlugin, InterwikiPlugin 3.04 65.2960526315789

TWiki core code benchmarks (Main/WebChanges)

Release Skin Plugins Time per page AthensMarks
base/athens   DefaultPlugin, InterwikiPlugin 5.07 100
checkout/DEVELOP classic DefaultPlugin, InterwikiPlugin 5.305 95.5702167766258
develop/search classic DefaultPlugin, InterwikiPlugin 5.21 97.3128598848369

Discussion

Attached to the topic are the Grep implementation and the modified Search module.

-- RafaelAlvarez - 19 Nov 2004

Rafael, could you please explain:

  • what the release column means. You lost me. Heading cairo, release develop/search ???

I think the discussion SecurityAlertExecuteCommandsWithSearch showed that the current grep implementation is some handycapped as it is not calling grep directly, but starts a shell which then starts grep. So speedup is possible - for example using the fix of KennethPorter should speedup the and make it safer. I have not found in the topics you benchmarked, so I guess grep is called somewhat differently. Are there shell calls involved too?

-- FrankHartmann - 19 Nov 2004

Ok, let me explain:

Under the Cairo header, base/cairo is the unmodified Cairo distribution, and develop/searchcairo is the patched cairo.

Under the DEVELOP header, checkout/DEVELOP is some version of the DEVELOP branch (don't have the number at hand frown ) and develop/search is the patched version.

About the topics:

  1. WhatIsWikiWiki don't have a search. It's like the "control group" to test that everything is fine.
  2. Main includes SiteMap which in turn uses
  3. WebChanges includes WebChanges, which in turn calls the search script that calls the TWiki::UI::Search module that calls the TWiki::Search module, the one I patched.

So, the last two topics go through SearchDotPm.

I'll patch my installations with the fix in SingleEntryPointForSystemCalls and rerun the benchmarks.

-- RafaelAlvarez - 19 Nov 2004

New Benchmarks:

TWiki core code benchmarks (Main/WebHome)

Release Skin Plugins Time per page AthensMarks
base/athens   DefaultPlugin, InterwikiPlugin 0.594444444444444 100
base/trunk classic DefaultPlugin, InterwikiPlugin 1.204 49.3724621631598
base/DEVELOP classic DefaultPlugin, InterwikiPlugin, TestFixturePlugin 1.324 44.8976166498825
develop/trunk classic DefaultPlugin, InterwikiPlugin 0.991666666666667 59.9439775910364
develop/DEVELOP classic DefaultPlugin, InterwikiPlugin 0.811428571428571 73.2589984350548

TWiki core code benchmarks (Main/WebChanges)

Release Skin Plugins Time per page AthensMarks
base/athens   DefaultPlugin, InterwikiPlugin 1.77 100
base/trunk classic DefaultPlugin, InterwikiPlugin 1.792 98.7723214285714
base/DEVELOP classic DefaultPlugin, InterwikiPlugin 1.092 162.087912087912
develop/trunk classic DefaultPlugin, InterwikiPlugin 1.746 101.374570446735
develop/DEVELOP classic DefaultPlugin, InterwikiPlugin 1.024 172.8515625

  1. base/trunk: MAIN branch, revision 3238 patched with KennethPorter patch
  2. base/DEVELOP: DEVELOP branch, Revision 3238 (unpatched)
  3. develop/trunk: MAIN branch, revision 3238 modified to use the perl implementation of grep
  4. develop/DEVELOP: DEVELOP branch, revision 3238 modified to use the perl implementation of grep

Can someone verify these results with bigger webs, please?

-- RafaelAlvarez - 19 Nov 2004

It would be useful to see MAIN r3238 unpatched, to separate the effects of the exec patch from the DEVELOP versus MAIN performance.

But it looks like, for a small web, the lower efficiency of an in-process grep is more than offset by the elimination of the separate process.

A large web would tell us more about the comparative performance of the built-in grep versus the external program.

-- KennethPorter - 19 Nov 2004

Results for a regex search across 5.000 topics, three of which match the search.

TWiki core code benchmarks for BigWeb/BigSearch

Release Skin Plugins Time per page AthensMarks
athens   DefaultPlugin, 0.360714285714286 100
DEVELOP (original search) classic DefaultPlugin 0.758571428571429 47.6459510357815
DEVELOP (pure perl search) classic DefaultPlugin 0.6325 60.8087564609304

-- CrawfordCurrie - 20 Nov 2004

Any idea why Athens is getting such good times?

-- KennethPorter - 20 Nov 2004

I have 3 (of course, I cheated: I'm looking at the code smile ) :

  1. SearchDotPm in DEVELOP doubles in size it's counterpart in Athens
  2. In DEVELOP, there are at least 4 calls to handleCommonTags
  3. There is a lot more pattern matching and extracting in DEVELOP than in Athens

That's just a quick scan. I'm sure that a deeper review will show us more.

-- RafaelAlvarez - 20 Nov 2004

I presume that the phrase "regex search across 5.000 topics" is the normal cross-cultural confusion over whether "." or "," represent a thousand or a decimal point. I'm assuming that the test is significant so there are five thousand topics.

I do wonder, though, if the size of the file and amount of "near misses" (i.e. the regex engine almost makes it but still does its cranking) is significant. I also wonder if the number of matches and the number of matches per topic is significant.

The issue, Kenneth, isn't why Athens gets good time; its a smaller code base, does less work, compiles faster. As you point out, the issue is 'what is the turn over point' where the number of files makes up for the cost of spawning a number of separate processes. I say "a number" becuase there is a limit on the size of the command line. Somewhere along the line more than one grep gets spawned.

Anyone for polyvariate analysis?

-- AntonAylward - 20 Nov 2004

Perfomance is critical mostly for large webs where improvements are noticable more then in smaller webs. Webs with 10K topics are not uncommon in a corporate environment. I recommend to base the benchmarks on a web with 10K topics and an average topic size of 4KB.

-- PeterThoeny - 21 Nov 2004

I realize that the shell has a command line limit. Does exec also have an argument length limit? (Recall that it takes an array of arguments.) If not, then skipping the shell removes that limit as an issue. (It's probably still an issue on Windows, where each program is responsible for parsing its own command line.)

-- KennethPorter - 21 Nov 2004

The current Search implementation accounts for the limit, it calls the external grep in a batch of 512 topics.

-- PeterThoeny - 21 Nov 2004

I believe the main limit on command length is the exec system call, on most systems.

-- RichardDonkin - 21 Nov 2004

The numbers I gave before were wrong. The AthensMarks are a bit zany because Athens doesn't actually work; the search returns no topics. I assume this is because of the arguments length issue. To keep everyone happy, I reran:

TWiki core code benchmarks for BigWeb/BigSearch, (search of 10,000 topics, average size 3858 bytes, 10 topics match the search)

Release Skin Plugins Time per page AthensMarks
DEVELOP (old search) classic DefaultPlugin,,,, 1.838 27.9651795429815
DEVELOP (perl search) classic DefaultPlugin,,,, 4.176 12.4521072796935

i.e. pure perl search is 44% of the speed of grep search in this case.

Note that when there are a lot of search results, the performance is totally swamped by the time taken to format the search results for display. Something I guess google discovered a long time ago.

-- CrawfordCurrie - 21 Nov 2004

WebStatistics shows this is a very popular topic (June 2005)...

-- MartinCleaver - 15 Jun 2005

To spead up search I think the best strategy is to install DBCachePlugin. I have observed that on complex searches there is a 50% speed improvement.

I am currently working on trying to rewrite the FormQueryPlugin interface to be as close to %SEARCH% as feasible....

-- ThomasWeigert - 15 Jun 2005

For those who don't mind compiling C code (as part of a Perl extension), see NativeSearch, which removes the dependency on grep and is quite efficient, though it may be a bit harder to install, depending on your platform.

-- RichardDonkin - 01 Jul 2007

Edit | Attach | Watch | Print version | History: r16 < r15 < r14 < r13 < r12 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r16 - 2007-07-01 - RichardDonkin
 
  • Learn about TWiki  
  • Download TWiki
This site is powered by the TWiki collaboration platform Powered by Perl Hosted by OICcam.com Ideas, requests, problems regarding TWiki? Send feedback. Ask community in the support forum.
Copyright © 1999-2026 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.