Help please!
I'm trying to build a plugin for custom TextFormattingRules and I'm having some difficulties. There doesn't seem to be a topic yet for these kinds of questions, so I started this one. Think of it as a CoffeeBreak for wannabee plugin authors.
Skipping HTML tags
I want to run my custom rendering rules on %TEXT%. I don't care about the page header or footers, I'm saving those for another project. How to skip over the html-style tags? e.g. my custom rendering rules would apply to the highlighted text but ignore anything between matching < >'s.
<div>
some stuff is <span class="subdued">
"unimportant"</span></div>
This is important because I'm replacing quotes and apostrophes with their typographicaly nicer representatives - “like this” as opposed to "like that".
thanks much.
--
MattWilkie - 31 Jan 2004, 02 Feb 2004
The short answer is using bare regexes to parse HTML is fraught with
problems - much more so than you would normally expect - don't do
it - use a proper parser to do it for you. (eg one of the
CPAN
modules, and step through the tree, perform replacements, etc)
(Parsing HTML correctly using tools like lex & yacc can be just as
awkward for that matter)
That doesn't help you if you don't want to do things that way.
What you're after with regard to your example is not to parse
the HTML, but only to deal with what's outside the tags. On the
surface of it a tag matches the form
<[^>]*?>
And a regex for stuff outside would then be:
[^<>]*
Which then makes an HTML page essentially something like:
((<[^>]*?>)*([^<>].*))*
Problem is HTML tags can include stuff like:
<font size="+0" title="Hello <World"> Hello World </font>
Which renders as for you as : >>>
<<< (Move your mouse there - it is there)
You might say it's malformed, but then the majority of
HTML on the web is malformed.
The best you can probably hope for is something close - match
any number of consecutive HTML tags followed by something that
isn't a tag, and process that with a global eval replacement.
(Whilst assuming your HTML doesn't have < or > inside
attributes - if it does, you're out of luck)
Something like:
$text = s/([^<>].*)<([^>]*?>/&handleOutsideHTMLTags($1)/ge;
$text = s/([^<>].*)$/&handleOutsideHTMLTags($1)/ge;
Line 1 handles all cases of (NonHTML HTML). The second handles
all cases of (NonHTML end-of-string)
(The /o modifier isn't sensible since the regex being built doesn't
change)
This is untested, but the approach suggested essentially notes:
- A page that might contain HTML consists of two types of text - sequences of NonHTML followed by HTML, or NonHTML followed by end of string. You're not interested in the situation where you have the string starting with HTML.
- The regexes for HTML is essentially < followed by a minimal number of non > chars, followed by a > char, with the noted invalid assumption that HTML tags don't contain < or > in tag attributes. (They shouldn't, but that's not the same as whether they do or don't)
- The regex for non HTML says that it's a sequence of non > or < chars.
It then calls a handleOutsideHTMLTags function, which should return
the replacement. (Your replacement of quotes can happen inside that
function.)
It's worth noting that this approach is pretty simple to break, and
that's part of the reason for not parsing HTML using regexes.
-- MS - 03 Feb 2004
Thanks for the response MS, especially for including some example regexes I can play with (btw, the RegexCoach is awesome for someone like me). I'm heartened to learn that it's not because I'm stupid that I haven't been able to get very far with this. On the other hand I wish it was just me being foolish and overlooking the obvious - because then it would be easier to solve!
-- MattWilkie - 05 Feb 2004
Okay, what's wrong with this code? The substitutions are being done, html tags are being skipped, but the page as rendered by twiki doesn't reflect this. EmptyPlugin.pm:
sub endRenderingHandler
{
foreach ( split (/(<.*?>)/s, $_[0], -1) ) {
#print "\n---split: $_ ";
# only process stuff which aren't html tags
# (assumes any split-line starting with < is a tag)
if ( $_ =~ /^[^<]/ ) {
$_ =~ s/\\\\\n/<br>/g; # manual line break
$_ =~ s/ --- /\—\;/g; # long dash
$_ =~ s/ -- /–/g; # short dash
$_ =~ s/\.\.\./…/g; # trailing elipsis...
$_ =~ s/"(.*?)"/&8220;$1&8221;/g; # curly quotes
# Why doesn't the rendered page show the substitutions?
# the debugs prove the work is being done:
print "\n---subst: $_ ";
TWiki::Func::writeDebug( "---subst: \t $_");
}
}
}
thanks in advance,
-- MattWilkie - 07 Feb 2004
The reason is this part: (paraphrased and line numbers added for clarity)
1: sub endRenderingHandler {
2: foreach ( split (/(<.*?>)/s, $_[0], -1) ) {
3: # only change $_
}
Stepping through the logic of what happens here.
- In TWiki.pm, a call is made to
endRenderingHandler via the plugins subsystem:
-
&TWiki::Plugins::endRenderingHandler( $result );
- That function calls
applyHandlers to loop through all the plugins.
- Key point:
$result is handled by reference - if you change $_[0] in the plugin, you change the value of $result in the main codeline.
Note also that the original call - &TWiki::Plugins::endRenderingHandler( $result );
requires the callee to change the value of the variabled referenced.
Stepping through your code:
- Your function is called -
$_[0] is set to be a reference to $result
- You take $_[0] and split it into lots of strings.
- This creates an array of strings (there's probably optimisatins internally, but that's the logic)
- These strings in this array and copies of the contents of
$_[0], not references to the contents of $_[0]
- That means when you do your substitutions on
$_ you are changing copies, not changing $_[0] . Since $_ doesn't reference anything and isn't saved, at the end of the body of the loop, the work done is thrown away.
After changing copies of all the sections in $_[0] (and
throwing the results away) the loop exits and the function exits,
leaving the value of $_[0] unchanged.
An approach that will work: (change %PATT%, %SKIP%, logic to fit
)
sub endRenderingHandler {
# my ($result,$snippet) =("",""); # mhw: hangs for some reason
my $result = "";
foreach $snippet ( split (/%PATT%/s, $_[0], -1) ) {
if ($snippet =~ /%SKIP%/) {
$snippet = &transform($snippet);
}
$result .= $snippet;
}
$_[0] = $result; # return result
}
An alternative, which depending on how fast string joins are might be quicker is: (it's pretty fast in perl though)
sub endRenderingHandler {
my @result = ("");
my $snippet;
foreach $snippet ( split (/%PATT%/s, $_[0], -1) ) {
if ($snippet =~ /%SKIP%/) {
$snippet = &transform($snippet);
}
push(@result,$snippet);
}
$_[0] = join "", @result; # return result
}
(None of the above are designed to run straight away, just explain what's wrong!)
Hope that helps.
-- MS - 07 Feb 2004
thank you very much for your help! With some tweaking I have been able to get method 1 to work. M2 has some syntax errors which I fixed (unmatched braces, which I'll fix here when I get home and have the code in front of me) but still gives me empty results.
In Method One I had to change:
my ($result,$snippet) =("","");
to
my $result = "";
otherwise the page would just sit there 'loading' forever. {shrug} I'm happy, I'm finally back to what I was really trying to do a week ago.
thanks again for your help.
-- MattWilkie - 11,12 Feb 2004
Great ! Pleased to be of help. Seems a bit OTT to me for the
desired effect, but it's been useful to me too - supporting this
kind of transform in a sensible way would be cool.
Have fun,
-- MS - 12 Feb 2004
Skipping Rendering Stages
I see by the StepByStepRenderingOrder that rendering plugins will get called at least 3 times (%text%, %metadata%, %template%). Is there anyway way to flag steps as "clean"? e.g. No need to call me here, I don't do metadata or templates, only text?
-- MattWilkie - 02 Feb 2004
Random Notes
Added a couple of heading since the Q's answers would be distinct. Hope that's OK.
of course it is. increasing clarity is always appreciated. -MW
-- MS - 03 Feb 2004