Include URL with %INCLUDE{...}% Variable
Introduction
We already know and use the
%INCLUDE{...}% variable to include topics. This enhancement allows you to embed an external web page inside a TWiki topic. Below code is tested and is working. With this we have something like a poor man's
SOAP client sans
SOAP (see
TWikiAsWebServicesClient).
Syntax
%INCLUDE{"http://host/path/to/page.html" pattern="reg-exp"}%
The
nameless parameter is the URL. A parameter
not starting with "http:" is handled the usual way, e.g. a topic is included. An external web page is included in case it starts with "http:". Supported types are full qualified URLs with
http protocol, domain name and optional port number, i.e.
http://somewhere:80/index.html.
Supported content types are
text/html and
text/plain.
The full page is included by default, but the
HTML header is stripped in case it is a web page.
The
pattern parameter is optional and allows you to extract some parts of a web page. Specify a
RegularExpression that scans from start (
'^') to end and contains the text you want to keep in parenthesis, i.e.
pattern="^.*?(text to keep).*".
Usage Examples
1. Display regression test results in a TWiki page
<pre>
%INCLUDEURL{"http://domain/~user/REDTest.log.txt"}%
</pre>
2. Display AltaVista's robot.txt file
- You type:
-
%INCLUDE{"http://www.altavista.com/robots.txt"}%
- You get:
%INCLUDE{"http://www.altavista.com/robots.txt"}%
3. Display the SUNW stock quote in a TWiki page
- You type:
-
SUNW: %INCLUDE{"http://finance.yahoo.com/q?s=SUNW&d=v1&o=t" pattern="^.*?>SUNW</a>[^<]+(.*?)\s+\S+\s+<small.*"}%
- You get:
4. Temperature in San Jose
- You type:
-
San Jose: %INCLUDE{"http://weather.yahoo.com/forecast/San_Jose_CA_US_f.html" pattern="^.*?([0-9]+\º\;F).*"}%
- You get:
To Do
Questions
The pattern parameter is powerful, but requires regular expression knowledge. Is the synax to complicated? Better to offer non reg-exp search?
- Yes, the syntax is too complicated (and essentially write-only). Aside note: Somebody should really host a WWW service that "disassembles" regexps for viewing by a non-expert, and TWiki should offer a link there on the edit pages (or any other page where regexp could be entered or displayed). But it is well-established in a lot of TWiki communities, so it should remain available.
It might be a good idea to offer search patterns in the style of Alta Vista, or to offer search facilities similar to those of Bugzilla.
[ JoachimDurchholz - 25 Oct 2001 ]
Should the pattern syntax be made more powerful, i.e. split up into a search pattern paramter and a replace pattern parameter to allow reformatting of the included page? Example:
searchpattern="^.*?abc\s+(\S+)\s+(\S+).*$" replacepattern="Price: $1, change: $2"
- Yes, I think it should adhere to the usual conventions. Anybody who knows and uses regexps should be able to take advantage of them to their full extent.
Whether using it in the way shown above is a good idea depends on how frequent and how massive the changes in the text being searched are. IOW adopting XML might already do everything for us that search-and-replace patterns can do (but I agree that XML would be a major change, and I've got absolutely no idea how to make the users separate text and formatting).
[ JoachimDurchholz - 25 Oct 2001 ]
Security Considerations
- Are there any security risks with the user definable pattern match? I escaped the dangerous characters
$@%&#'`/. Is that enough?
if( $thePattern ) {
$thePattern =~ s/([^\\])([\$\@\%\&\#\'\`\/])/$1\\$2/go;
$thePattern =~ /(.*)/; # untaint
$thePattern = $1;
$text =~ s/$thePattern/$1/is;
}
If anybody knows, I am interested to learn if there is any security risk.
- Perl style note: The explicit untainting in line 3 is unnecessary, as
$thePattern will already be untainted in line 2.
[ JoachimDurchholz - 25 Oct 2001 ]
- I don't understand what line 4 does - I think it will overwrite the freshly-untained
$thePattern - is this correct?
[ JoachimDurchholz - 25 Oct 2001 ]
- A non-Perl question: I don't see any security risks at work here, other than the possible unadvertent inclusion of malicious scripts. It should be possible to exclude scripting depending on URL prefix - we don't want to include script code from a site that might have been compromised by a hacker, don't we?
[ JoachimDurchholz - 25 Oct 2001 ]
- Why do you think that
$@%&#'`/ are dangerous characters?
[ JoachimDurchholz - 25 Oct 2001 ]
Feedback
Please provide feedback here or in above text as
[ name ] - signed bullets.
--
PeterThoeny - 21 Jun 2001
Its a great idea - I started coding the same a while back but stopped because I had to get on with my job
Comments:
- I would need it to be able to call out via a proxy. The LWP library has a very comprehensive http-get functionality, perhaps we could use that. (but see CpanPerlModulesRequirement)
- Can things like this not exist as "internal plugins" rather than being "in the core"? I mean by this that they would be distributed with TWiki and are normally enabled yet use the plugin architecture for reasons of modularity. Please forgive me if I don't know what I am talking about as I have not looked at versions beyond the December 2000 release!
- See AreVariablesReallyDirectives
--
MartinCleaver - 22 Jun 2001
I think this could be useful, but a very good point from Martin about proxies.
I don't see why %INCLUDE% shouldn't be extended to have this functionality, just checking for
http: at start of param. If this goes in a plugin then it would be good to show a variable (good point from Martin under
AreVariablesReallyDirectives) can be enhanced by a plugin. If a site uses an authenicating proxy/Intranet, it might be necessary to disable this feature.
On proxies - we could add a new variable to
TWiki.cfg. I think for most proxies it's just a case of pushing =GET
http://...=
to the proxy rather than the actual site. No doubt it's more complicated for authenicating proxies.
--
JohnTalintyre - 22 Jun 2001
Good point, lets use the existing
%INCLUDE% variable for simplicity. For now lets take it into the core without proxi server support.
The first version is done and commited to
TWikiAlphaRelease. TWiki.org is updated as well. The
pattern parameter is now possible for normal topic include as well.
--
PeterThoeny - 23 Jun 2001
Isn't there are fair chance that including extracted
HTML will break page e.g. include a table start, but not end? I guess could be surrounded by an
iframe to help avoid this, would this mean that header should be kept? Of perhaps
iframe inclusion could be arg to variable.
--
JohnTalintyre - 23 Jun 2001
Yes, there is chance of broken
HTML content when you use the pattern parameter. Since this is a power feature casual users won't use, it is up to the power user to make sure that the regular expression does the right thing.
--
PeterThoeny - 23 Jun 2001
On the grounds of
eat your own dog food, I fed our internal wiki to itself. It complained...
Error: Unsupported content type: text/html; charset=ISO-8859-1. (Must be text/html or text/plain)
I suggest a small modification to avoid this kind of syntax error:
if( $contentType =~ /text\/html/ ) {
$text =~ s/^.*?<\/head>//ois; # remove all HEAD
$text =~ s///gois; # remove all SCRIPTs
$text =~ s/^.*?<body[^>]*>//ois; # remove all to <BODY>
$text =~ s/<\/body>.*//ois; # remove </BODY> to end
$text =~ s/<\/html>.*//ois; # remove </HTML> to end
} elsif( $contentType =~ /text\/plain/ ) {
# do nothing
I am still getting many sites that refuse to render - and would prefer to be able to turn off the content type checking in TWiki.cfg. For now, I have patched Net.pm with:
#if( $contentType eq "text/html" ) {
if( $text =~ //i ) {
In terms of improvements to this, I would like to suggest a way of confining included html to frames. That's not to say that the current include function is not useful, but to recognise that there is also a benefit to being able to serve up a particular page whole from another site from within TWiki.
--
SteveRoe - 28 Jun 2001
Having played around with IE, I realize that this can be achieved by just embedding an
HTML IFRAME in a topic:
<IFRAME NAME=content width=800 height=1200 SRC=$item></IFRAME>
I am now trying to wrap this into a TWiki var in the
PeerRatingSystem I am implementing as a plug in. The notion is that you can review either an internal wiki page, or any page on the web as a recommendation to your colleagues. I would like to be able to include the page to be reviewed for the convenience of reviewers.
[If it goes into a plugin, I will make external review optional because IFRAMES are not supported by NS]
--
SteveRoe - 29 Jun 2001
Made some fixes for better
HTML page rendering:
- Allow any type of text after content type value:
if( $contentType =~ /^text\/html/ ) {
- Fix incomplete URLs in
href="", src="", action="".
- Join lines of HTML tags that span multiple lines (to prevent TWiki rendering from escaping the angle brackets)
--
PeterThoeny - 01 Jul 2001
Desired enhancement
It would be very nice if we could use the
InterwikiPluginEarlyDev syntax inside INCLUDE.
- [ PeterThoeny - 24 Jul 2001 ] This could and should be done in the plugin itself.
Security Issue
Is there a way to avoid infinite recursion? What if I write
-
%INCLUDE{"https://www.twiki.org/cgi-bin/view/Codev/IncludeUrlVariable"}%
- [ PeterThoeny - 24 Jul 2001 ] This currently times out. Protecting against this is an appropriate thing to do.
--
AndreaSterbini - 24 Jul 2001
The include URL handling fixes incomplete links found in the included page. Until now it was just aware of
href=/path and
href="/path". It supports now also single quote references like
href='/path'. Is in
TWikiAlphaRelease.
--
PeterThoeny - 12 Sep 2003