Question
I have a problem in a web where search does not work anymore. The web has too many topics in it for grep, it sais:
bash: /bin/grep: Argument list too long
I tried to play around with
find and
xargs but did not succeed. Could a Unix guru help out?
- TWiki version: 01 Dec 2001
- Web server: Apache
- Server OS: Linux
- Web browser: N/A
- Client OS: N/A
--
PeterThoeny - 11 Dec 2001
Answer
The typical approach, which should work on any system with xargs, e.g. Linux and System V Release 4 variants such as Solaris, is:
find . -type f -name '*.txt' -print | xargs grep -l pattern
I just tried this out on Linux and it works OK. Any use of wildcarded arguments should really use xargs for scalability beyond the (approx) 10 Kbyte limit on command line arguments in Unix.
--
RichardDonkin - 12 Dec 2001
Thanks Richard, that works from the command line. A sample output is:
./TopicA.txt
./TopicB.txt
Now I need to make it work for TWiki. TWiki specifies the egrep or fgrep command in TWiki.cfg, i.e.
$egrepCmd = "/bin/egrep";
Parameters (switches, pattern and scope) are appended by TWiki, i.e.
/bin/egrep -i 'pattern' *.txt
It looks like I need a wrapper (name it i.e.
xegrep) that expects the parameters like
egrep and does the find xargs stuff. Note that leading
./ should be stripped and number of parameters varies.
--
PeterThoeny - 12 Dec 2001
The wrapper would work, but it would be easier to configure TWiki if you put the commands directly into a TWiki variable, so that $findCmd and $xargsCmd can be changed as appropriate; otherwise the user has to remember to edit the 'xegrep' script when installing TWiki, not just edit TWiki.cfg.
--
RichardDonkin - 13 Dec 2001
You are correct, better to do a clean solution. What I need for now is simply a quick hack on one installation.
Clean solution: Change the $egrepCmd to include %SWITCHES%, %PATTERN% %FILTER%. Then it is possible to replace the regular egrep by find / xargs.
--
PeterThoeny - 12 Dec 2001
I ran into this problem on the
GambasWiki
and was able to fix it with a combination of the strategies described here.
In lib/TWiki/Search.pm, I changed the following lines starting at 212:
if( $theScope eq "topic" ) {
$cmd = "$TWiki::lsCmd %FILES% | %GREP% %SWITCHES% -- $TWiki::cmdQuote%TOKEN%$TWiki::cmdQuote";
} else {
$cmd = "%GREP% %SWITCHES% -l -- $TWiki::cmdQuote%TOKEN%$TWiki::cmdQuote %FILES%";
}
with the following:
if( $theScope eq "topic" ) {
$cmd = "find . -type f -name '%FILES%' -print | perl -pe 's|^\.\/||;' | grep %SWITCHES% -l -- $TWiki::cmdQuote%TOKEN%$TWiki::cmdQuote";
} else {
$cmd = "find . -type f -name '%FILES%' -print | perl -pe 's|^\.\/||;' | xargs -n 5 grep %SWITCHES% -l -- $TWiki::cmdQuote%TOKEN%$TWiki::cmdQuote";
}
It seems to be working well. (We include a "referenced by" search at the bottom of every page so our generated static version used in the help browser is automagically cross-referenced, so it was kinda important.) I'm surprised this doesn't come up more often and wonder if this fix should be standard issue. (Edit: I added the perl -pe clause up above because it fixed some formatting weirdness due to find's property of (correctly) prepending "./" to everything if you search ".".)
--
RobKudla - 27 Jul 2003
I thank you for posting this and commend your ingenuity Rob. I also urge the
CoreTeam caution we when implement this. There is nothing wrong with this suggestion but we really ought to isolate all architecture dependent functionality, in particular where they materialise as external calls. I suggest we route them all through a new TWiki/Arch.pm (
ArchDotPm) with subclasses TWiki::Arch::Unix and TWiki::Arch::Windows, etc
--
MartinCleaver - 27 Jul 2003
You're right, of course; I was assuming when I wrote that fix that TWiki already depended on egrep and ls (for example) being there but I suppose that's why it's $TWiki::lsCmd and %GREP% in the command lines.
I actually wonder if in this case some of that couldn't be avoided by reimplementing what we need of ls and egrep in a platform-independent way. Maybe the way to do that would be to have e.g. TWiki::Arch::Fallback and have a native Perl implementation of external commands that differ by architecture, for those architectures without niceties like xargs.
--
RobKudla - 27 Jul 2003
Indeed, so in turn you are quite right. TWiki and the people using it and supporting it would all benefit it we
GetRidOfAllExternalLinkages.
IIRC, someone submitted such a bunch of patches on Codev quite recently, but I forget who and it is too easy to lose such things on Codev (See
PleaseCreateNewCategories)
I strongly suspect that the patches didn't make
TWikiAlphaRelease.
--
MartinCleaver - 27 Jul 2003
Putting all external code into TWiki isn't an unalloyed benefit - it would simplify installation, but Perl-based grep is definitely slower, and search is already the slowest feature of TWiki for large installations. Patches to avoid the need to use
xargs are very welcome, as are those to use it in a configurable way - I'd suggest a flag in
TWiki.cfg called
$largeWeb that is set to
1 to force use of xargs (or a slower Perl-based loop for argument processing + grep-launching technique).
As for the development issues, this is best talked about on Codev - everything I am working on (or not) is on TWiki.org, but right now I don't have a lot of time and I'm afraid the same is true of most other
CoreTeam members. The best way for people to get onto the
CoreTeam is to submit a few high quality patches that are suitable for the core and conform to the
PatchGuidelines. So far I haven't seen many such patches...
--
RichardDonkin - 28 Jul 2003
This needs to be fixed, see
Codev.ArgumentListIsTooLongForSearch
--
PeterThoeny - 10 Sep 2003
The
ArgumentListIsTooLongForSearch issue has been fixed on 01 Nov 2003 and is available in the latest
TWikiAlphaRelease or
TWikiBetaRelease.
--
PeterThoeny - 26 Jan 2004
I implemented the above fix by modifying my Search.pm file, but now the ref-by search performed by the rename script doesn't work. Using the ref-by link seems to work OK, as does a search, but the rename script missed the pages.
Can anyone suggest why this might be? I will upgrade to the latest version, but I'll need to do that under change control, and right now I need to get it working.
--
AlexGarner - 17 Mar 2005
The "ref-by" function does not work because of a few small bugs. 1) Instead of hardwiring the
grep, it should be
%GREP%. This allows
egrep to be invoked when needed. 2) Delete the
-l option from the first
$cmd line. This allows "topic" level searches to work again. 3) Technically, there is a missing backslash in the perl command, although this bug does not cause any harm.
The corrected patch that will fix the broken "ref-by" functionality is:
if( $theScope eq "topic" ) {
$cmd = "find . -type f -name '%FILES%' -print | perl -pe 's|^\\.\/||;' | %GREP% %SWITCHES% -- $TWiki::cmdQuote%TOKEN%$TWiki::cmdQuote";
} else {
$cmd = "find . -type f -name '%FILES%' -print | perl -pe 's|^\\.\/||;' | xargs -n 5 %GREP% %SWITCHES% -l -- $TWiki::cmdQuote%TOKEN%$TWiki::cmdQuote";
}
This patch (like the original one above) does not work for
RegularExpression searches that include a semicolon (
;) which indicates a boolean AND search. It is also slow because of the
xargs command. I attach an alternate fix
(context diff given by
recursive_grep.txt) that solves both problems. This fix works only if you have a version of
grep which supports the
-r recursive and the
--include=PATTERN filter options. Used together, they eliminate the need for
xargs. In ad hoc testing on systems with 10,000 to 30,000 documents, the recursive
grep is about 6-8 times faster than
grep using
xargs. However because of other processing overhead, the actual page rendering is only about twice as fast as the
xargs version. Still, it is definitely noticeable. With this fix, you might be able to squeeze a few more productive months out of the 01 Feb 2003 version.
--
BrianPark - 25 Jun 2005
Category:
TWikiPatches