Abstract
A proxy is a piece of middleware that sits between the browser (or another program pretending to be a browser) and the Internet connection. When a web page is requested, the request goes to the proxy, which downloads the relevant data, optionally preprocesses it, then returns it to the browser as expected.
The
Web Scraping Proxy
is a way to automatically generate the
CPAN:LWP
Perl code necessary to emulate a real browser within scripts.
Applications
Other Application Ideas
- could be used to generate more detailed logs and statistics analysis
- /me wonders if it is possible to implement the existing logs using an (internal) proxy server and if so, would it be a GoodThing?
- a debugging tool
- simulate real user scenarios in a load test situation
- need to add timestamps to
translate.pl and add sleep statements to simulate user inactivity
- it may be possible to use this as part of a benchmarking suite, although i haven't done so myself, and the TWiki benchmarking suites probably wouldn't benefit from this (i mention it for completeness)
Setup
Server
CPAN Requirements
Download and Install
download
https://twiki.org/p/pub/Codev/WebScrapingProxy/translate.pl.txt and wsp version 2
mkdir wsp ; cd wsp
wget -O - http://www.research.att.com/~hpk/wsp/wspv2.tgz | tar xz
# (edit wsp.pl to have proper path to perl binary)
wget -O - http://twiki.org/p/pub/Codev/WebScrapingProxy/translate.pl.txt >translate.pl
chmod +x translate.pl
Start Server
./wsp.pl -v | ./translate.pl >drive-lwp.pl
To see the events printed to the terminal as they happen, run the output through
tee:
./wsp.pl -v | ./translate.pl | tee >drive-lwp.pl
Browser Configuration
|
- Your browser needs to be configured to use
wsp.pl as a proxy. Methods vary from browser to browser, but in most cases you just set HTTP Proxy to localhost and Port to 5634
-
If you're running the WebScrapingProxy server locally, be sure to clear entries in No Proxy for: localhost, 127.0.0.1
- Browse to a page in your wiki (making sure to use port
5634, eg http://localhost:5634/twiki/bin/view/ )
- Watch the output (if you used
tee), or inspect drive-lwp.pl (eg, tail -f drive-lwp.pl)
|
Development
Outstanding Issues
- i've not actually used any code output from the WebScrapingProxy yet. in particular, i'm not sure what issues will come up regarding user authentication. i'm working on this background documentation first.
- as TWiki doesn't have a pure REST interface, cgi parameters will probably also need to be logged and new code generation code will need to be added to
translate.pl
- this page looks too "DocumentMode"-y; don't hesitate to refactor or add...
Brainstorming/Wishlist
There are many possible improvements and applications for the
WebScrapingProxy
- an improvement would be to pick up any
Set-Cookie headers and reuse them for the remainder of the session
- could a proxy be used to (transparently) cache external INCLUDE's?
wsp.pl improvements
wsp.pl could do with a few improvements; patches (
diff -u) could be attached to this topic
- shebang line (
#!) doesn't have the "standard" /usr/bin/perl path
-
-i option ignores .jpg, .gif, and .css; could/should add others (especially .png, but perhaps other media types, too)
- script should conditionally
use the SSL modules and simply disable (or fail) if trying to use the proxy with an SSL connection
Resources
--
WillNorris - 27 Sep 2004
Anonymous proxy network specifically for web scraping.
www.ScrapeGoat.com
--
AaronWillis - 25 Jun 2005