Tags:
create new tag
, view all tags
Include URL with %INCLUDE{...}% Variable

Introduction

We already know and use the %INCLUDE{...}% variable to include topics. This enhancement allows you to embed an external web page inside a TWiki topic. Below code is tested and is working. With this we have something like a poor man's SOAP client sans SOAP (see TWikiAsWebServicesClient).

Syntax

%INCLUDE{"http://host/path/to/page.html" pattern="reg-exp"}%

The nameless parameter is the URL. A parameter not starting with "http:" is handled the usual way, e.g. a topic is included. An external web page is included in case it starts with "http:". Supported types are full qualified URLs with http protocol, domain name and optional port number, i.e. http://somewhere:80/index.html.

Supported content types are text/html and text/plain.

The full page is included by default, but the HTML header is stripped in case it is a web page.

The pattern parameter is optional and allows you to extract some parts of a web page. Specify a RegularExpression that scans from start ('^') to end and contains the text you want to keep in parenthesis, i.e. pattern="^.*?(text to keep).*".

Usage Examples

1. Display regression test results in a TWiki page

  <pre>
  %INCLUDEURL{"http://domain/~user/REDTest.log.txt"}%
  </pre>

2. Display AltaVista's robot.txt file

  • You type:
    • %INCLUDE{"http://www.altavista.com/robots.txt"}%
  • You get:
%INCLUDE{"http://www.altavista.com/robots.txt"}%

3. Display the SUNW stock quote in a TWiki page

  • You type:
    • SUNW: %INCLUDE{"http://finance.yahoo.com/q?s=SUNW&d=v1&o=t" pattern="^.*?>SUNW</a>[^<]+(.*?)\s+\S+\s+<small.*"}%
  • You get:
    • SUNW:

Failed to include URL http://finance.yahoo.com/q?s=SUNW&d=v1&o=t Protocol scheme 'https' is not supported (LWP::Protocol::https not installed)

4. Temperature in San Jose

  • You type:
    • San Jose: %INCLUDE{"http://weather.yahoo.com/forecast/San_Jose_CA_US_f.html" pattern="^.*?([0-9]+\&ordm\;F).*"}%
  • You get:
    • San Jose:

Failed to include URL http://weather.yahoo.com/forecast/San_Jose_CA_US_f.html Protocol scheme 'https' is not supported (LWP::Protocol::https not installed)

To Do

Questions

The pattern parameter is powerful, but requires regular expression knowledge. Is the synax to complicated? Better to offer non reg-exp search?

  • Yes, the syntax is too complicated (and essentially write-only). Aside note: Somebody should really host a WWW service that "disassembles" regexps for viewing by a non-expert, and TWiki should offer a link there on the edit pages (or any other page where regexp could be entered or displayed). But it is well-established in a lot of TWiki communities, so it should remain available.
    It might be a good idea to offer search patterns in the style of Alta Vista, or to offer search facilities similar to those of Bugzilla.
    [ JoachimDurchholz - 25 Oct 2001 ]

Should the pattern syntax be made more powerful, i.e. split up into a search pattern paramter and a replace pattern parameter to allow reformatting of the included page? Example: searchpattern="^.*?abc\s+(\S+)\s+(\S+).*$" replacepattern="Price: $1, change: $2"

  • Yes, I think it should adhere to the usual conventions. Anybody who knows and uses regexps should be able to take advantage of them to their full extent.
    Whether using it in the way shown above is a good idea depends on how frequent and how massive the changes in the text being searched are. IOW adopting XML might already do everything for us that search-and-replace patterns can do (but I agree that XML would be a major change, and I've got absolutely no idea how to make the users separate text and formatting).
    [ JoachimDurchholz - 25 Oct 2001 ]

Security Considerations

  • Are there any security risks with the user definable pattern match? I escaped the dangerous characters $@%&#'`/. Is that enough?

    if( $thePattern ) {
        $thePattern =~ s/([^\\])([\$\@\%\&\#\'\`\/])/$1\\$2/go;
        $thePattern =~ /(.*)/;     # untaint
        $thePattern = $1;
        $text =~ s/$thePattern/$1/is;
    }

If anybody knows, I am interested to learn if there is any security risk.

  • Perl style note: The explicit untainting in line 3 is unnecessary, as $thePattern will already be untainted in line 2.
    [ JoachimDurchholz - 25 Oct 2001 ]
  • I don't understand what line 4 does - I think it will overwrite the freshly-untained $thePattern - is this correct?
    [ JoachimDurchholz - 25 Oct 2001 ]
  • A non-Perl question: I don't see any security risks at work here, other than the possible unadvertent inclusion of malicious scripts. It should be possible to exclude scripting depending on URL prefix - we don't want to include script code from a site that might have been compromised by a hacker, don't we?
    [ JoachimDurchholz - 25 Oct 2001 ]
  • Why do you think that $@%&#'`/ are dangerous characters?
    [ JoachimDurchholz - 25 Oct 2001 ]

Feedback

Please provide feedback here or in above text as [ name ] - signed bullets.

-- PeterThoeny - 21 Jun 2001

Its a great idea - I started coding the same a while back but stopped because I had to get on with my job smile

Comments:

  • I would need it to be able to call out via a proxy. The LWP library has a very comprehensive http-get functionality, perhaps we could use that. (but see CpanPerlModulesRequirement)
  • Can things like this not exist as "internal plugins" rather than being "in the core"? I mean by this that they would be distributed with TWiki and are normally enabled yet use the plugin architecture for reasons of modularity. Please forgive me if I don't know what I am talking about as I have not looked at versions beyond the December 2000 release!
  • See AreVariablesReallyDirectives

-- MartinCleaver - 22 Jun 2001

I think this could be useful, but a very good point from Martin about proxies. I don't see why %INCLUDE% shouldn't be extended to have this functionality, just checking for http: at start of param. If this goes in a plugin then it would be good to show a variable (good point from Martin under AreVariablesReallyDirectives) can be enhanced by a plugin. If a site uses an authenicating proxy/Intranet, it might be necessary to disable this feature.

On proxies - we could add a new variable to TWiki.cfg. I think for most proxies it's just a case of pushing =GET http://...= to the proxy rather than the actual site. No doubt it's more complicated for authenicating proxies.

-- JohnTalintyre - 22 Jun 2001

Good point, lets use the existing %INCLUDE% variable for simplicity. For now lets take it into the core without proxi server support.

The first version is done and commited to TWikiAlphaRelease. TWiki.org is updated as well. The pattern parameter is now possible for normal topic include as well.

-- PeterThoeny - 23 Jun 2001

Isn't there are fair chance that including extracted HTML will break page e.g. include a table start, but not end? I guess could be surrounded by an iframe to help avoid this, would this mean that header should be kept? Of perhaps iframe inclusion could be arg to variable.

-- JohnTalintyre - 23 Jun 2001

Yes, there is chance of broken HTML content when you use the pattern parameter. Since this is a power feature casual users won't use, it is up to the power user to make sure that the regular expression does the right thing.

-- PeterThoeny - 23 Jun 2001

On the grounds of eat your own dog food, I fed our internal wiki to itself. It complained...

Error: Unsupported content type: text/html; charset=ISO-8859-1. (Must be text/html or text/plain)

I suggest a small modification to avoid this kind of syntax error:

    if( $contentType =~ /text\/html/ ) {
        $text =~ s/^.*?<\/head>//ois;            # remove all HEAD
        $text =~ s///gois;   # remove all SCRIPTs
        $text =~ s/^.*?<body[^>]*>//ois;         # remove all to <BODY>
        $text =~ s/<\/body>.*//ois;              # remove </BODY> to end
        $text =~ s/<\/html>.*//ois;              # remove </HTML> to end

    } elsif( $contentType =~ /text\/plain/ ) {
        # do nothing

I am still getting many sites that refuse to render - and would prefer to be able to turn off the content type checking in TWiki.cfg. For now, I have patched Net.pm with:

    #if( $contentType eq "text/html" ) {
    if( $text =~ //i ) {

In terms of improvements to this, I would like to suggest a way of confining included html to frames. That's not to say that the current include function is not useful, but to recognise that there is also a benefit to being able to serve up a particular page whole from another site from within TWiki.

-- SteveRoe - 28 Jun 2001

Having played around with IE, I realize that this can be achieved by just embedding an HTML IFRAME in a topic:

<IFRAME NAME=content width=800 height=1200 SRC=$item></IFRAME>

I am now trying to wrap this into a TWiki var in the PeerRatingSystem I am implementing as a plug in. The notion is that you can review either an internal wiki page, or any page on the web as a recommendation to your colleagues. I would like to be able to include the page to be reviewed for the convenience of reviewers.

[If it goes into a plugin, I will make external review optional because IFRAMES are not supported by NS]

-- SteveRoe - 29 Jun 2001

Made some fixes for better HTML page rendering:

  • Allow any type of text after content type value:
    if( $contentType =~ /^text\/html/ ) {
  • Fix incomplete URLs in href="", src="", action="".
  • Join lines of HTML tags that span multiple lines (to prevent TWiki rendering from escaping the angle brackets)

-- PeterThoeny - 01 Jul 2001

Desired enhancement

It would be very nice if we could use the InterwikiPluginEarlyDev syntax inside INCLUDE.

  • [ PeterThoeny - 24 Jul 2001 ] This could and should be done in the plugin itself.

Security Issue

Is there a way to avoid infinite recursion? What if I write

  • %INCLUDE{"http://twiki.org/cgi-bin/view/Codev/IncludeUrlVariable"}%

  • [ PeterThoeny - 24 Jul 2001 ] This currently times out. Protecting against this is an appropriate thing to do.

-- AndreaSterbini - 24 Jul 2001

The include URL handling fixes incomplete links found in the included page. Until now it was just aware of href=/path and href="/path". It supports now also single quote references like href='/path'. Is in TWikiAlphaRelease.

-- PeterThoeny - 12 Sep 2003

Edit | Attach | Watch | Print version | History: r17 < r16 < r15 < r14 < r13 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r17 - 2003-09-12 - PeterThoeny
 
  • Learn about TWiki  
  • Download TWiki
This site is powered by the TWiki collaboration platform Powered by Perl Hosted by OICcam.com Ideas, requests, problems regarding TWiki? Send feedback. Ask community in the support forum.
Copyright © 1999-2017 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.