Tags:
create new tag
, view all tags
The more I analyse performance the more I run into TWiki::Render::getRenderedVersion as being a major critical section of TWiki code and huge performance hog. Specifically the inner line loop regularly accounts for 50% of the runtime on large topics.

The question is, how can we improve on this? I've tried various ways, but I can't fix it without changing the semantics of TWiki rendering in subtle ways. I think those semantics have to change to get more performance, so I thought we should discuss the impact of this.

My proposal is to change this method to eliminate the line loop and have it operate in phases instead. In what follows, by "lift out" I mean an extraction mechanism similar to current verbatim handling. The psuedo-code would be:

  1. Lift out everything that should not be touched - =<verbatim, HTML, and comments
    1. Lift out <script> sections
      1. Lift out <pre> sections
        1. handler
        2. Process block level TML (lists and tables, <div> level formatting)
      2. Re-insert <pre> sections
      3. Process inline TML (<span> level formatting)
      4. handler
      5. Lift out <noautolink> sections
        1. Apply wikiword substitutions
      6. Replace <noautolink> sections
    2. Replace <script> sections
    3. Expand variables
  2. Replace <verbatim> and comment sections

This way those incredibly expensive substitutions are each applied once, instead of repeatedly over every line in the topic.

Note that one immediately obvious implication would be the loss of the line-by line 'outsidePre' and 'insidePre' plugin handlers. These could instead be replaced with handlers where I wrote handler

Your concerns, comments, observations please (especially you, Peter!)

-- CrawfordCurrie - 19 Sep 2004

in the same way as Perl had to change sublty to be compilable - we should use real ComputerScience methods to make TML deterministic.

Peter - what do you think?

-- SvenDowideit - 20 Sep 2004

I have always wondered (and asked somewhere else before) why the line loop is needed. After taking some courage and look into the code, I found that the main reason is to render list and tables in an easy way, so if we change that loop we need to find an alternative way to render them.

-- RafaelAlvarez - 20 Sep 2004

One way would be to switch to a line loop approach only if there are tables/lists to process, but otherwise use a global subs for as much as possible.

-- CrawfordCurrie - 26 Sep 2004

After implementing a Wiki in Java, I "found" that there are 2 kinds of wiki tags:

  • Individual tags: Those that affect only a single piece of text (think about the <span> html tag). Examples are the Bold, Italic and inlined pre tags
  • Block tags: Those that affect a block of text (think about the <div> html tag). Examples are the lists and table tags.

Individual tags can be rendered using a single pass over the text (well, one pass per tag). Block Tags needs a special parsing (ie, may need a line-by-line parsing).

If we can implement a mechanism to process block tags, we can remove the OUTSIDEPRE/INSIDEPRE loop completely. The rendering process will be:

  • Take out verbatim
  • Process blocks
  • Process individual
  • Put back verbatim

-- RafaelAlvarez - 27 Sep 2004

Hmmm. Doesn't that imply an even greater change to TML? Can you list what you think consitutes a "Block" versus an "Individual" in TML?

-- CrawfordCurrie - 28 Sep 2004

Do you mean these block elements: address, blockquote, br, center, del, div, h, hr, ins, li, ol, p, pre, table, td, th, tr, ul ?

-- ArthurClemens - 28 Sep 2004

No, Arthur. I mean block elements in the TwikiML. They are: Lists (unordered, ordered, definition), Tables, verbatim and noautolink. The "individual" are all the rest. If you want a quick rule of thumb to classify them, "individual" tags can be rendered using a single regex, while the "block" tags need some special "treatment" and to keep the knowledge that you're rendering that kind of block.

-- RafaelAlvarez - 29 Sep 2004

OK, we're on the same wavelength, only your description is a bit neater than mine. I think what you want is do-able by a step 6a: process lists and tables. Yes?

-- CrawfordCurrie - 29 Sep 2004

Yes, I think so. smile

-- RafaelAlvarez - 29 Sep 2004

I reworked it a bit, to take care of some conerns.

Can anyone see what is broken in the pseudocode at the top?

-- CrawfordCurrie - 29 Sep 2004

Sven's commnent is interesting.

TWiki and other perl based Wikis I've looked at all seem to work on the principle of applying perl's pattern matching capability to go "cher-clunk" on large quantites of stuff. And then do another "cher-clunk" for antoehr pattern on the same data.

And do it even if there is nothing that matches. So you "cher-clunk" for bold or italic or underline or whatever even if its not there.

For example, we get

  • pull out the verbatim etc
  • go cher-clunk to transform the somethingshomethingsomething to the HTML or span yamma yamma for monospace ... even if its not there ... then ...
  • go cher-clunk to transform the somethingshomethingsomething to the HTML or span yamma yamma for bold ... even if its not there ...

... and so forth.

But I've always been impressed by parsers. I've written a lot of compilers, and lets face it, with tools like YACC/Bison its not that hard. As Sven says, perl had to face up to it.

The thing is state machines are FAST!

Yes, I know that the pattern mathcing of perl is a state machine, that you Henry Spenser and others for the regex code. But the perl code is doing repeated set-up and cher-clunk.

If I had to give an analogy I'd go back to the ealy steam driven harvesters. The steam engine sat at the edge of the field and the farmers set up a post at the other side and the engine dragged the 'cutters" across. They then moved the whole machinary forward. Manually. Cher-chugga-chugga, step-and-repeat. Compare that to today: we have the combine harvesters (or in my case the lawnmower) that just goes through eating it up like a face in a pac-man game and spewing out the mulsh, wheat, baling it ... whatever.

Deterministic FSA grammar parsers are like that. Do it properly and you don't back-track.

It also solved the "verbatim" problem. the lawn mower - err parser - sees the verbatim open and just eats and spews unchanged until it seees the verbatime close. During that time it doens't call any plugin in handlers or do any other translations.

So what to do? Personally I like Currie's idea of experimental forks in PostCairoDevelopmentModel. But in reality its not a big deal to experiment with a limited ML and YACC just off on the side with a simple test harness to do a proof of concept. Last month I had time; this month with the Sarbanes-Oxley deadlines looming I don't.

But Sven is right. "Computer Science" has a lot of tools its bag. We can't aways come up with something to "replace" chunks of TWiki straight off, but that doens't mean we shouldn't experiment and present "proof of concept" and comparative tests.

-- AntonAylward - 30 Sep 2004

As a computer scientist I have to agree. As a pragmatist, I have to disagree. Parsers are fine when you have a nice parseable language, describable with a LALR grammar. TML isn't like that; it is far more elastic. I guess you could fully describe a grammar for TML, but the number of states would be enormous. Big state machines are notoriously difficult to control, and the sheer complexity would daunt any potential contributor.

The best you might achieve would be a Flex-like tokeniser with some general state information - hold on, that's what's already there, isn't it? Isn't the line loop exactly that? And exactly what we are trying to get rid of?

-- CrawfordCurrie - 30 Sep 2004

P.S. A challenge for the unwary; write a Bison grammar for TML. Oh, and don't forget that the plugins handlers may modify the parsed language at any point.....

-- CrawfordCurrie - 30 Sep 2004

"The plugins can modify the parsed language"? AFAIK, what the plugins can modify is the parsed text, but that should be before or after TWiki parses it. If a plugins add a new tag, well, that it's problem.

Overall, I strongly agree with Crawford. But if I get some spare time, I'll try to write that grammar (coulnd't resist this kind of challenges smile )

-- RafaelAlvarez - 30 Sep 2004

Pedant. OK, so the plugins can only change the input, not the grammar. But plugins can certainly add tags to the text that may require reparsing from the start. frown

Plugins can't change the grammar.... hey, would that be neat? A plugin that changes the grammar in the middle of a parse.... wink

-- CrawfordCurrie - 30 Sep 2004

"A plugin that changes the grammar in the middle of a parse.... " - you've just invented LISP wink

Actually a grammar doesn't have to be a LALR to be parsable. Its just that computer scientist professors like to set exams they can mark. Richard Bornat in his classic book on compiler writng takes a departure from the Aho-Johnson-Ullman ideas and says things like ..

  • "LR analysers maintian efficiency (i.e state table size, no backtracking) by forcing determinism..."

  • "Books on compiler writng ... concentrate on syntax analysis .. because the theoretical attention this subject has received makes it capable of highly formal presentation .. easy for academics to write examination questions about. (p 249)

  • ... backtracking's ... most important drawback is that it prevents accurate error reporting ...

Which we don't care about - we're eyeballing the errors! See below.

Its worth remembering the difference between HTML and XML: XML is a "_proper_" grammar from the POV of deterministic parsing. For every <operator> there must be a corresponding </operator> whereas in HTML you can have a <bold> and forget to close, have some more stuff, headings and so forth, and end up with the </body></html>. OK so it will not look like you wanted, but that isn't the point. That happens now. I did it yesterday editing anothre topic. You see it and you fix it. Its a typo! I'm sure you have made similar mistakes too, or mistyped "vernatom" or something.

My point is that even with one-token lookahead we can get this done. And that's enough. You want some example?

The token analyser sees three dashes at the begining of a line and does one lookahead. If it sees a "+" it pushes it back and hands BEGINTITLE back to the parser. The parser dispatches parseandEmitTitle() and away it goes.

But the real boost is dealing with verbatim bllocks .... oh I love that idea!

"Plugins modify.." the text, yes. Push back and re-parse. Heck, its a standard technique with a Strachey-Brown macro processor. When I was studying under P J Brown at Canterbury back in the mid 70s we built ML-1. This was a parser/translator that was a macro processors and was fully recursive. FULLY. The language could modify itself, push back and re-parse. We built compilers with it! We did it then so don't tell me it can't be done now.

See http://members.shaw.ca/parz/ML1.htmlfor comments and source.

Oh, and before I forget: fact of compiler/parser life: of course the line loop is expensive, all lexical analysis is. That's life.

-- AntonAylward - 30 Sep 2004

The point is:

  • Lexical analysis is slow
  • FSA-driven parsing can be faster than repeated application of substitutues
  • FSA-driven parsing solves some otherwise thorny problems like "verbatim", "<pre" etc
  • Its not easy with some tools because TWiki-ML isn't a 'regular' well formed language
  • Despite that, it can be done becuase it has been done

-- AntonAylward - 01 Oct 2004

Of course you are right. However the pragamatist in me keeps reminding me of that old Irish way of giving directions: If I wanted to get there, I wouldn't start from here

I think it would be very very very difficult to convert the TWiki rendering engine into a token driven FSM and still retain anything close to existing semantics. Don't let me discourage you from trying, however!

-- CrawfordCurrie - 01 Oct 2004

There's one thing that hasn't been brought up: if you look at HTML parsers, you'll notice that there are a lot of hacks to handle unclosed/misnested tags. Most will still give up if the <>'s don't match up. The thing about TWiki's approach is this: if there's no syntax, you can't possibly have a syntax error. No matter how bulletproof you try and make it, a parser can fail...and what do you do then?

-- WalterMundt - 01 Oct 2004

Please, let's drop the parser question. Talk is cheap, and it's a red herring unless someone pitches in with working code. I'd rather focus on whether my logic at the top of the page is right or not. After all, I'm probably the only schmuck who's actually going to do anything about implementing this.

-- CrawfordCurrie - 01 Oct 2004

Perhaps it should be good to have a handler between phase 1.3 and 2 (after expanding variables, and before putting back the verbatim). That way a plugin can post-process the generated html tags.

-- RafaelAlvarez - 01 Oct 2004

An alternative approach used by Textile and Redcloth (and somewhere out there is 'blucloth') is to read the input into an array of lines and process those.

You can still do the cher-clunk on the array as a whole, but this approach makes a few things easier. I played with it ewarlier in the year and found that I could do things like pull out verbatim blocks very easily. I could also take lines that had enbeded <pre> and do a split and inert, so making them look like verbtim blocks.

In fact that "insert array of lines into the array of lines" can be a useful tool.

The syntax of Textile is a bit different from TWiki but that's not really an issue.

-- AntonAylward - 01 Oct 2004

Above rendering order is probably one workable solution if you start from scratch. For TWiki it will break existing content and TWikiApplications that expect the current rendering order.

-- PeterThoeny - 01 Nov 2004

Hmmm. I think what you just said was "no change to semantics is acceptable". If that is the case, we can write off any chance of performance improvement through optimising the rendering loop any further. Oh well.

-- CrawfordCurrie - 01 Nov 2004

I dissagree with Peter totally here, I think that anything that is flakey enough to break due to an improvement of the determinism of the syntax, should be forced to change.

The best we (the TWiki open source community) can offer, is to think about run once conversion tools to convert legacy applications. And this is dependant on getting enough information to understand the specifics of the problem (i'd suggest in unit test form).

This is the focus of the DevelopBranch, and as solutions come to maturity, they can be moved to MainBranch.

So. please do not use legacy TWikiApplications to discourage innovation, or improving the determinism / rigourousness of TML.

-- SvenDowideit - 03 Nov 2004

fixed reversed output from TWiki::Plugins::outsidePREHandler (SVN DEVELOP 1808) inside getRenderedVersion()

-- WillNorris - 04 Nov 2004

Will's fix wasn't quite right, fixed in 1813

-- CrawfordCurrie - 04 Nov 2004

Marking this as done, because it's not really a feature and refactoring will be followed up in Bugs web.

-- CrawfordCurrie - 11 Jan 2006

Edit | Attach | Watch | Print version | History: r34 < r33 < r32 < r31 < r30 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r34 - 2007-10-21 - GilmarSantosJr
 
  • Learn about TWiki  
  • Download TWiki
This site is powered by the TWiki collaboration platform Powered by Perl Hosted by OICcam.com Ideas, requests, problems regarding TWiki? Send feedback. Ask community in the support forum.
Copyright © 1999-2017 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.