Over the years there have been many attempts to characterise TWiki performance, some more successful than others. Performance evaluation falls into two main areas:
- End user performance. This is the performance perceived by a TWiki user sitting at their browser, working with TWiki. It's best measured using
ab (the apache benchmark tool), and is the performance that most people want to hear about when we discuss "performance" in public.
- More focused evaluation of code runtime. This is primarily of interest to developers, as it's the way we identify hotspots in the code so they can be optimised, and is the subject of this topic.
In the past we have made use of performance analysis tools such and
CPAN:Develop::Dprof and
CPAN:Develop::SmallProf to try and analyse TWiki performance. These profilers suffer from a major problems. They rely on time-based sampling. This works well in an environment where a few functions absorb the bulk of the runtime, but isn't that helpful when a lot of smaller functions are run sequentially, as the sample rate can end up sampling a pseudo-random selection of functions. This results in
different performance information for each run, even over the same data. The sampling is highly sensitive to perl's memory management, which in turn depends on the what the operating system is doing. The conclusion of my own experience is that these tools are useless for profiling TWiki performance, except in the rare case where something in the code is so bad that it stands out like a sore thumb. TWiki is simply too fast to be usefully profiled this way.
A more accurate measure is obtainable by
instrumenting, where the code is instrumented with special statements that record runtime at different points in the code. This is the basis of my methodology.
In brief, the approach is use a class called
Monitor, which is used like
ASSERT - unless the
TWIKI_MONITOR environment variable is defined, the functions of this package are nops and are optimised out of existance by the perl compiler. The Monitor class uses
CPAN:Benchmark
to capture runtime snapshots at various points, and outputs a report at the end of the runtime.
I'm going to make a contentious statement.
As perl programs go, TWiki
does not perform badly.
Let's take a look at the detail of the runtime for a particular topic; in this case, I'm interested in the compilation time, which has often been pilloried. To generate the trace below, I added a BEGIN block that is executed as each module just completed compilation, so any line that says "XXX begin" is a reasonable reflection of the compilation time for that module. Out of the box TWiki, typical configuration (Template login) and I'm rendering
Main:
| Event |
Delta |
Absolute |
| TWiki::Sandbox begun |
0.006006 wall 0.01 CPU |
0.006006 wall 0.01 CPU |
| Static configuration loaded |
0.00957608 wall 0.01 CPU |
0.0155821 wall 0.02 CPU |
| TWiki::Access begun |
0.00133896 wall 0.00 CPU |
0.016921 wall 0.02 CPU |
| TWiki::Meta begun |
0.00286412 wall 0.00 CPU |
0.0197852 wall 0.02 CPU |
| TWiki::Attrs begun |
0.016588 wall 0.01 CPU |
0.0363731 wall 0.03 CPU |
| TWiki::Store begun |
0.018333 wall 0.02 CPU |
0.0547061 wall 0.05 CPU |
| TWiki::Prefs::Parser begun |
0.00115395 wall 0.00 CPU |
0.05586 wall 0.05 CPU |
| TWiki::Prefs::PrefsCache begun |
0.00152206 wall 0.00 CPU |
0.0573821 wall 0.05 CPU |
| TWiki::Prefs begun |
0.00163388 wall 0.00 CPU |
0.059016 wall 0.05 CPU |
| TWiki::Attach begun |
0.00482416 wall 0.01 CPU |
0.0638402 wall 0.06 CPU |
| TWiki::Form begun |
0.00489783 wall 0.00 CPU |
0.068738 wall 0.06 CPU |
| TWiki::Access begun |
0.00360608 wall 0.00 CPU |
0.0723441 wall 0.06 CPU |
| TWiki::Plugin begun |
0.00256109 wall 0.01 CPU |
0.0749052 wall 0.07 CPU |
| TWiki::Func begun |
0.00801897 wall 0.01 CPU |
0.0829241 wall 0.08 CPU |
| TWiki::Plugins begun |
0.00342083 wall 0.00 CPU |
0.086345 wall 0.08 CPU |
| TWiki::Render begun |
0.0166001 wall 0.02 CPU |
0.102945 wall 0.10 CPU |
| TWiki::Search begun |
0.00997305 wall 0.01 CPU |
0.112918 wall 0.11 CPU |
| TWiki::Templates begun |
0.00250697 wall 0.00 CPU |
0.115425 wall 0.11 CPU |
| TWiki::LoginManager begun |
0.00718594 wall 0.01 CPU |
0.122611 wall 0.12 CPU |
| TWiki::Users begun |
0.00268412 wall 0.00 CPU |
0.125295 wall 0.12 CPU |
| TWiki::I18N begun |
0.00232196 wall 0.00 CPU |
0.127617 wall 0.12 CPU |
| TWiki::UI begun |
0.026973 wall 0.03 CPU |
0.15459 wall 0.15 CPU |
| TWiki::UI:View begun |
0.00332785 wall 0.00 CPU |
0.157918 wall 0.15 CPU |
As we can see, each module takes 1 or 2 centiseconds to compile. Some modules are physically bigger than others, and they take longer to compile. This "compile time" accounts for ~0.15 seconds which we can take as a constant - it's time that has to be spent irrespective of the page being viewed.
The next job is creating the session, and that's a big job - isn't it?
| About to create session |
0.0080719 wall 0.01 CPU |
0.166477 wall 0.16 CPU |
| TWiki::Users::TWikiUserMapping beginning |
0.0219061 wall 0.02 CPU |
0.188383 wall 0.18 CPU |
| TWiki::Users::TWikiUserMapping begun |
0.00474095 wall 0.01 CPU |
0.193124 wall 0.19 CPU |
| Htpasswd begun |
0.001158 wall 0.00 CPU |
0.194282 wall 0.19 CPU |
| TWiki session created |
0.044009 wall 0.04 CPU |
0.238291 wall 0.23 CPU |
Big? I guess so, it took 0.07s. But that included reading all the TWiki preferences, so altogether I think that's not too shabby. So, as this is a view, access permissions need to be checked:
| Access checked |
0.000117064 wall 0.00 CPU |
0.238408 wall 0.23 CPU |
OK, now we can start the view proper. We have 0.23s of "constant time" under our belts, that we can't do much about, haven't we? Well, yes, we do, but only because in the course of this exercise I changed a
use File::Temp to
require File::Temp, saving 0.1s. And I commented out
use Carp (:fatalsToBrowser) which saved another 0.05s. (So there are still
proportionally significant gains to be found through simple analysis of the compilation process).
Now, this is Pattern skin, so there's a web left bar. When I run this same page with Classic skin, it renders in 0.27s i.e. the whole page rendering process takes 0.04s. But with Pattern skin, it takes:
| END |
0.110471 wall 0.11 CPU |
0.530478 wall 0.52 CPU |
Nearly twice as long. Why? Here's why:
| Getting weblist |
0.0866899 wall 0.09 CPU |
0.325098 wall 0.32 CPU |
| Got weblist |
0.094909 wall 0.09 CPU |
0.420007 wall 0.41 CPU |
When I switch off hierarchical webs, the total runtime drops back to 0.4s - so listing the hierarchical webs is taking ~0.12s. Note that I am using a leftbar with no twiki tips. When I include the twiki tips, the runtime increases by another 0.15s.
So, what does that mean for the end user? Well, we've established that the perl runtime for rendering this page is of the order of 0.5s. What does
ab say?
Time per request: 630.933 [ms] (mean)
Same server, same load conditions, little or no network latency. Classic skin:
| END |
0.069102 wall 0.07 CPU |
0.317981 wall 0.31 CPU |
Time per request: 408.015 [ms] (mean)
OK, what lessons can we take from this?
- The above analysis provides a way to baseline the code - eliminate the compilation phase, so we can focus in on the post-compilation execution that is the most time-consuming. It has already indicated simple changes that together achieve a (relatively) major improvement in static initialisation time.
- Static initialisation time can be improved, but only by reducing the amount of code loaded; by whatever method. Peter has always argued that "extra modules penalise runtime", and he is dead right. We should do this, be we cannot expect major performance improvement from it.
- The design of the skin can has huge impact. Much larger than optimising the code can ever deliver. What appear to be small productivity features in the skin can have major performance impacts.
- If you are not using a feature, turn it off
My feeling after this exercise is that optimising the code will provide ever diminishing returns, and that far more would be gained by putting effort into (1) cacheing and (2) optimising the skins. That's not to say there aren't still some slow chunks of code after initialisation - I'm well aware of the performance issues with queries, and search generally is a bear - but I think there are many more accessible gains to be made by looking again at the skins. And just about anyone can optimise a skin - no perl experience required!
--
Contributors: CrawfordCurrie - 04 Jun 2007
Discussion
So the dynamic parts are eating seconds, no surprise. I've been using
VarCachePlugin for our homepage. I wonder if that could be used selectively for left bar and top bar.
Pattern skin has a lot of icon decoration. Each icon creates a server up-to-date request, and that adds. It would be better if we would use 1 image file and use
CSS to position it ("sliding doors technique"). Any ideas how we could make this work with no code change for the users? (global setting anyone...)
Crawford, did you test with locale? You didn't mention it, so I assume you did not. With a language selector and lots of
MAKETEXT performance will be worse. We do need to try to use
mo files instead of
po files.
--
ArthurClemens - 04 Jun 2007
No, I didn't. Maybe some time, if I have time.
--
CrawfordCurrie - 04 Jun 2007
Crawford. Excellent work.
Some details. The ab is a very good tool for the perl part. But measuring how long it takes from the page is loaded and the skin takes to check that all the many files are up to date and run javascripts and calculate the page rendering adds up to 0.5 seconds. So skin developers can do a lot to optimize. BUT skin CPU time is not depending that much on server load because it is mainly happening in the client CPU. The perl code combined with many users doing many searches kills the server - and searching is naturally the worst bandit.
It seems the weblist could be a quick little win. How about having the list in a cache file like it was implemented with the language list used in the top bar? That seems like a simple thing that could give a little but good win. Updating the cache could be on the web manage page.
--
KennethLavrsen - 04 Jun 2007
about the web cache, I see a "small" problem: when should the cache be updated? If webs where created only through TWiki tools, it's trivial to update the cache on each web manipulation. But things get complicated when you can create webs manually (cp _default
MyNewWeb -R). Perhaps this can be solved by providing a perl script to update the cache from the cli.
--
RafaelAlvarez - 04 Jun 2007
Sven tells me he once proposed having the list of webs in a
configure variable. I wouldn't go quite that far,
but there is definitely a case for having an "index file" in the root that has a list of webs. There is another reason for wanting this. At the moment webs have nowhere to hang meta-data - except by creating side files or topics in the webs themselves - which has been a PITA for a long time. A properly designed index file could provide that.
An index file would also provide a way (potentially) to support softlinks on Windows platforms, as it could record absolute paths to the web directories.
--
CrawfordCurrie - 05 Jun 2007
Another important factor of code analysis is how it
scales: how is performance related to an increasing factor x with x = number of topics, number of users, number of requests, whatever.
Typically, the gain of caching a web list/a topic list/backlinks/user mapping info/templates will increase with the number of webs/topics/users/whatever up to a certain point where cache maintenance kicks in and the ratio stagnates or even starts dropping again.
We need to explore these kind of experiments to get to know our code.
--
MichaelDaum - 05 Jun 2007
I added a
Monitor::This(module, method) which allows us to see how long a particular method takes, or even, all the methods in a module (if you leave method undefined). Its already helped me find one single performance hog in the usermapping code, and hopefull will allow us to do similar in other places.
--
SvenDowideit - 13 Jun 2007
Michael - I think your statement, about
cache maintenance overheads that cause cache efficiency to drop as numbers scale up, is only true if the algorithm for cache replacement (deciding which cache entry to drop when adding new entry to a fixed size cache) is inefficient, i.e.
O(N)
not
O(1). There are many efficient
O(1) cache replacement algorithms - see
this search
for some papers, but note that the popular LRU (Least Recently Used) is one of them, implementable as a linked list ordered by last access time.
--
RichardDonkin - 16 Jun 2007