Abstract
Applications
- TWikiTestingInfrastructure
- Other Application Ideas
Setup
- Server
- Browser Configuration
Development
- Outstanding Issues
- Brainstorming/Wishlist
  - wsp.pl improvements
Resources

Abstract

A proxy is a piece of middleware that sits between the browser (or another program pretending to be a browser) and the Internet connection. When a web page is requested, the request goes to the proxy, which downloads the relevant data, optionally preprocesses it, then returns it to the browser as expected.

The Web Scraping Proxy is a way to automatically generate the CPAN:LWP Perl code necessary to emulate a real browser within scripts.

Applications

TWikiTestingInfrastructure

the intended primary use of the WebScrapingProxy is to generate automated testing scripts which will be "replayed" for use as UnitTests and RegressionTests

Other Application Ideas

could be used to generate more detailed logs and statistics analysis
- /me wonders if it is possible to implement the existing logs using an (internal) proxy server and if so, would it be a GoodThing?
a debugging tool
simulate real user scenarios in a load test situation
- need to add timestamps to translate.pl and add sleep statements to simulate user inactivity
it may be possible to use this as part of a benchmarking suite, although i haven't done so myself, and the TWiki benchmarking suites probably wouldn't benefit from this (i mention it for completeness)

Setup

Server

CPAN Requirements

(optional) CPAN:Net::OpenSSH, CPAN:Net::SSLeay (delete server-key.pem if you don't need SSL and don't want to install these modules)
CPAN:Socket
CPAN:CGI
CPAN:Getopt::Std
CPAN:HTML::TokeParser
CPAN:HTML::Form
CPAN:HTTP::Request::Common
CPAN:LWP::UserAgent

Download and Install

download https://twiki.org/p/pub/Codev/WebScrapingProxy/translate.pl.txt and wsp version 2

mkdir wsp ; cd wsp
wget -O - http://www.research.att.com/~hpk/wsp/wspv2.tgz | tar xz
# (edit wsp.pl to have proper path to perl binary)
wget -O - http://twiki.org/p/pub/Codev/WebScrapingProxy/translate.pl.txt >translate.pl
chmod +x translate.pl

Start Server

./wsp.pl -v | ./translate.pl >drive-lwp.pl

To see the events printed to the terminal as they happen, run the output through tee:

./wsp.pl -v | ./translate.pl | tee >drive-lwp.pl

Browser Configuration

Your browser needs to be configured to use wsp.pl as a proxy. Methods vary from browser to browser, but in most cases you just set HTTP Proxy to localhost and Port to 5634
- If you're running the WebScrapingProxy server locally, be sure to clear entries in No Proxy for: localhost, 127.0.0.1
Browse to a page in your wiki (making sure to use port 5634, eg http://localhost:5634/twiki/bin/view/)
Watch the output (if you used tee), or inspect drive-lwp.pl (eg, tail -f drive-lwp.pl)

Development

Outstanding Issues

i've not actually used any code output from the WebScrapingProxy yet. in particular, i'm not sure what issues will come up regarding user authentication. i'm working on this background documentation first.
as TWiki doesn't have a pure REST interface, cgi parameters will probably also need to be logged and new code generation code will need to be added to translate.pl
this page looks too "DocumentMode"-y; don't hesitate to refactor or add...

Brainstorming/Wishlist

There are many possible improvements and applications for the WebScrapingProxy

an improvement would be to pick up any Set-Cookie headers and reuse them for the remainder of the session

could a proxy be used to (transparently) cache external INCLUDE's?

wsp.pl improvements

wsp.pl could do with a few improvements; patches (diff -u) could be attached to this topic

shebang line (#!) doesn't have the "standard" /usr/bin/perl path
-i option ignores .jpg, .gif, and .css; could/should add others (especially .png, but perhaps other media types, too)
script should conditionally use the SSL modules and simply disable (or fail) if trying to use the proxy with an SSL connection

Resources

http://www.research.att.com/~hpk/wsp/ Web Scraping Proxy
Spidering Hacks: #30 Utilizing the Web Scraping Proxy
- translate.pl emulates a browser more fully re:HTTP headers

-- WillNorris - 27 Sep 2004

Anonymous proxy network specifically for web scraping. www.ScrapeGoat.com

-- AaronWillis - 25 Jun 2005

WebForm
TopicClassification	FeatureBrainstorming
TopicSummary
InterestedParties
AssignedTo
AssignedToCore
ScheduledFor
RelatedTopics
SpecProgress
ImplProgress
DocProgress

Attachments

Topic attachments
I	Attachment	History	Action	Size	Date	Who	Comment
txt	translate.pl.txt	r1	manage	1.2 K	2004-09-27 - 22:15	UnknownUser	converts wsp logs into LWP Perl scripts
png	web-scraping-proxy-settings.png	r1	manage	28.4 K	2004-09-27 - 22:14	UnknownUser	browser proxy settings dialog box

Topic revision: r3 - 2005-06-25 - AaronWillis

Account
- Log In
- Register User

Edit
Attach

Ideas, requests, problems regarding TWiki? Send feedback. Ask community in the support forum.
Copyright © 1999-2026 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.