Tags:
create new tag
view all tags

Abstract

A proxy is a piece of middleware that sits between the browser (or another program pretending to be a browser) and the Internet connection. When a web page is requested, the request goes to the proxy, which downloads the relevant data, optionally preprocesses it, then returns it to the browser as expected.

The Web Scraping Proxy is a way to automatically generate the CPAN:LWP Perl code necessary to emulate a real browser within scripts.

Applications

TWikiTestingInfrastructure

Other Application Ideas

  • could be used to generate more detailed logs and statistics analysis
    • /me wonders if it is possible to implement the existing logs using an (internal) proxy server and if so, would it be a GoodThing?
  • a debugging tool
  • simulate real user scenarios in a load test situation
    • need to add timestamps to translate.pl and add sleep statements to simulate user inactivity
  • it may be possible to use this as part of a benchmarking suite, although i haven't done so myself, and the TWiki benchmarking suites probably wouldn't benefit from this (i mention it for completeness)

Setup

Server

CPAN Requirements

Download and Install

download https://twiki.org/p/pub/Codev/WebScrapingProxy/translate.pl.txt and wsp version 2
mkdir wsp ; cd wsp
wget -O - http://www.research.att.com/~hpk/wsp/wspv2.tgz | tar xz
# (edit wsp.pl to have proper path to perl binary)
wget -O - http://twiki.org/p/pub/Codev/WebScrapingProxy/translate.pl.txt >translate.pl
chmod +x translate.pl

Start Server

./wsp.pl -v | ./translate.pl >drive-lwp.pl

To see the events printed to the terminal as they happen, run the output through tee:

./wsp.pl -v | ./translate.pl | tee >drive-lwp.pl

Browser Configuration

web-scraping-proxy-settings.png
  1. Your browser needs to be configured to use wsp.pl as a proxy. Methods vary from browser to browser, but in most cases you just set HTTP Proxy to localhost and Port to 5634
    • warning.gif If you're running the WebScrapingProxy server locally, be sure to clear entries in No Proxy for: localhost, 127.0.0.1
  2. Browse to a page in your wiki (making sure to use port 5634, eg http://localhost:5634/twiki/bin/view/)
  3. Watch the output (if you used tee), or inspect drive-lwp.pl (eg, tail -f drive-lwp.pl)

Development

Outstanding Issues

  • i've not actually used any code output from the WebScrapingProxy yet. in particular, i'm not sure what issues will come up regarding user authentication. i'm working on this background documentation first.
  • as TWiki doesn't have a pure REST interface, cgi parameters will probably also need to be logged and new code generation code will need to be added to translate.pl
  • this page looks too "DocumentMode"-y; don't hesitate to refactor or add...

Brainstorming/Wishlist

There are many possible improvements and applications for the WebScrapingProxy

  • an improvement would be to pick up any Set-Cookie headers and reuse them for the remainder of the session

  • could a proxy be used to (transparently) cache external INCLUDE's?

wsp.pl improvements

wsp.pl could do with a few improvements; patches (diff -u) could be attached to this topic
  • shebang line (#!) doesn't have the "standard" /usr/bin/perl path
  • -i option ignores .jpg, .gif, and .css; could/should add others (especially .png, but perhaps other media types, too)
  • script should conditionally use the SSL modules and simply disable (or fail) if trying to use the proxy with an SSL connection

Resources

-- WillNorris - 27 Sep 2004

Anonymous proxy network specifically for web scraping. www.ScrapeGoat.com

-- AaronWillis - 25 Jun 2005

Topic attachments
I Attachment History Action Size Date Who Comment
Texttxt translate.pl.txt r1 manage 1.2 K 2004-09-27 - 22:15 UnknownUser converts wsp logs into LWP Perl scripts
PNGpng web-scraping-proxy-settings.png r1 manage 28.4 K 2004-09-27 - 22:14 UnknownUser browser proxy settings dialog box
Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r3 - 2005-06-25 - AaronWillis
 
  • Learn about TWiki  
  • Download TWiki
This site is powered by the TWiki collaboration platform Powered by Perl Hosted by OICcam.com Ideas, requests, problems regarding TWiki? Send feedback. Ask community in the support forum.
Copyright © 1999-2026 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.