Report Plugin Idea
In
FormQueryPlugin and
DBCacheContrib there is an implementation of handling structured data in TWiki.
All my ideas are basically stolen from there.
If you are looking for a working solution, look there.
But please tell me about your requirements (use the comment box at the end of this topic)!
I try to propose a more complete and general way of handling structured data in TWiki.
Unfortunately I'm not a Perl programmer and don't have much time in the near future to implement this.
But I'm starting to share my thoughts.
If I get some good feedback from you, I hope there's something getting out that will be worth learning Perl and spending my time

I expect that some things have to get much simpler first...
And feel free to turn my
Pidgin English
to real phrases.
Where I Don't Want To Go
No TWiki Data in a Database (like MySQL).
I chose TWiki among other things because it stores data in textfiles.
A Database would make TWiki harder to install and maintain and more dependent of certain technologies.
Files are always the easiest way to store Data.
You can use all of your favorite editors and file-managing tools.
No Re-Invention of SQL
On the one hand because it's to complicated and would be never as good as already existing database tools.
But more than this,
SQL just sucks.
It's wannabe-easy but very hard for ordinary people to learn.
And It's much to powerful for this application.
Data Structure
All is based on
tables and
links.
Tables are uniformed data structures.
Links are relations or references allowing one to jump from one table to another.
The diagrams show the structure like it appears for the user.
I does not mean that the data is stored like that in the cache.
More about the cache later.
Basic Structure
The basic structure covers the webs, topics, revisions, attachments and users.
The textlines are just a syntactical trick to query multiple lines in the same topic.
There could be more, like
groups,
offices,
plugins and so on.
I think it's complex enough to start...
There is the option to have the structured data from
all the former revisions.
(Is this of much use? It would be possible to query "who changed when whitch field to what value" ...)
For the moment, the
topic link of the
revisions table points always to the current topic, not the topic of the revision.
Web preferences are part of the dynamic structure, if it is definied in forms.
Dynamic Structure
Forms and embedded tables are dynamically created tables.
Embedded tables must be formatted using the
EditTablePlugin, definied in a separate topic.
This topic is the definition topic of this table and has to be declared in the ReportPlugins konfiguration.
There are predefined links to the topic they are stored in, but nothing more.
The basic strucure could be enhanced by defining user-defined links.
This links belong to a user-defined table (
TWikiForm or
EmbeddedTable) and relate to any other table.
For instance, is there a topicname in a formfield, a link could be definied to this topic.
This is done with
%LINK{...}%.
Meta Structure
Because of its extensibility, metadata should be available in
tables and
fields.
There is currently no idea about a syntax that allows one to access data of a field which name is taken from another field.
Links
While accessing the data one could always follow the links to another table.
Every link is a reference in the table which owns the link.
A link points always to exactly one row in the foreing table.
One could
not follow a link backward.
This is because the backward relation is always on-to-many and thus ambiguous.
There is the possibility with
Nested Reports to list foreign tables that have a "many" relation.
The Cache
There should be a chache like the
DBChacheContrib.
I propose only to put the structured parts (metadata, forms and embedded tables) into the cache.
The full text could be taken directly from the files.
So the fields
text,
summary,
textline.text should not be in the chache.
Functions
There are ordinary functions and aggregat functions.
Ordinary functions
| func |
comment |
format_time(time, format) |
Format a time field to a string. Formatstring like 'yyyy-mm-dd hh:mi:ss'. |
format_num(number, format) |
Format a number to a string. Formatstring like '###\'###\'##0.00' . |
to_time(string, format) |
Evaluates a string and turns it to a internal time value (to compare time fields with constants or user-input) |
to_num(string, format) |
Evaluates a string and turns it to a number. (necessary?) |
calc(...) |
Everything between the brackets is passed to the SpreadSheetPlugin, if installed. Data fields are passed as parameter, there should be a explicit syntax for that. I have to check if there is no syntactical problem. |
| ? |
... |
Aggregat Functions
| func |
comment |
sum(field) |
|
min(field) |
|
max(field) |
|
avg(field) |
average |
| =any(field) |
|
count |
Count rows in the group |
distinct(field) |
Count distinct occurrences of a value. (Or list of values?) |
list(field) |
List all distinct values in the group and separates it with ", " |
Queries
Actually queries are mostly intended to group and filter data several times.
Simple things could directly be made in a report.
I would like to have Queries definied in the plugins topic to be chached, while others are not.
Queries result always in a tabular structure.
There is always
only one source for a query.
Because of the links it is not necessery to have more than one (believe me!)
Data could be
grouped (like
GROUP BY in SQL).
Grouping allows the use of
Aggregate Functions.
Links that are not grouped by are lost.
Grouping generates implicitly a query named
queryname_detail.
The detail query is composed of a link to the original row called
row and a link to the group called
group.
Reports
Reports are used to print the data.
Reports could be nested.
You could define a Query
ad hoc when defining a report.
Syntax Proposition
Bold parameters are
not optional.
%QUERY{...}% Statement
Parameters for
%QUERY%:
| param |
comment |
name |
Name of the query. |
source |
The data source: one table (topicinfo, attachments, revinfo, Formname or EmbeddedTableName) or another query |
first_row |
First record to display |
row_count |
Print at most row_count records. |
filter |
a filterexpression. Format: "field='value'" to filter to a certain value. Instead of = use ~ (regex), >, <, >=, <=, =. |
groupby |
fields to goup by or onerow to group fields to one row. |
calculate |
a list of regular functions to evaluate and turn to regular columns. Format: "expression as columnname, expression as columnname ... " (as columnname is optional) |
aggregate |
a list of aggregate functions to evaluate and turn to regular columns. Format like calculate. |
- Filtering is always done before grouping. If you want to filter the result of the group, you have to define another query.
- calculate is always done before grouping. If you want to calculate with grouped values, use
aggregate instead.
- aggregate is always done while grouping. If you want to group or filter by aggregated values, you also have to define another query.
- Queries are not ordered. Sorting is done in a
%REPORT%
Questions
- I don't know how flexible
filter will be. Is "function(field)>field" possible?
- What about datatypes? What values are compared as string or as number? (Same for sorting)
%REPORT{...}% Statement
There is the
%REPORT{...}% and
%ENDREPORT% statement, between which the fields of the query are available.
An inner nested part refers to the surounding one.
There must be a link between the source of the inner and outer report part.
Parameters for
%REPORT{...}%:
| param |
comment |
name |
Name of the report part to refer to. |
source |
like in %QUERY% |
first_row |
like in %QUERY% |
row_count |
like in %QUERY% |
filter |
like in %QUERY% |
calculate |
like in %QUERY% |
groupby |
like in %QUERY% |
aggregate |
like in %QUERY% |
order |
a list of fields to sort by. Use "-Fieldname" for descending order |
link |
The links to use to join the source of the outer report part, if is should be linked |
%FIELD{...}% Statement
Between
%REPORT{}% and
%ENDREPORT%, fields are accessed simply using
%FIELD{"fieldname"}%.
Parameters for
%FIELD{"fieldname"}%:
A
"fieldname" is composed as
name,
link.name,
link.link.name etc.
reportname:name,
reportname:link.name and so forth.
(
creation.time,
parent.current.user.name,
parent.Product.Producer,
myreport:topic.name).
%LINK{...}% Statement
Parameters for
%LINK{...}%:
| param |
comment |
name |
The name of the link. With this name the target table could be reached from the source table |
table |
the data source: a userdefinied table (Formname or EmbeddedTableName) |
target |
the link target, any table. |
fields |
List of fields in the source table pointing to one of the target records, corresponding to targetkey. |
targetkey |
List of fields in the target table that uniquely define a record in the target table |
Examples
Topics and Revisions
Get a list of topics which contain
SEARCH.
Render it with a list of all contributers.
%REPORT{name="topicpart" source="topics" filter="text~'.*SEARCH.*'" order="creaton.time"}%
---++ %FIELD{"name"}%
| revision | (%FIELD{"current.revision"}% |
| author | (%FIELD{"current.user.name"}% |
| revision time | (%FIELD{"format_time(current.time, 'dd.mm.yyyy hh:mi:ss')"}% |
---+++ Former Revisions and Contributers
%REPORT{name="revpart" source="revisions" link="topic" order="revision" filter="revision < topicpart:current.revision"}%
| *rev* | *user* | *date* |
| %FIELD{"revision"}% | %FIELD{"user.name"}% | %FIELD{"format_time(time, 'dd.mm.yyyy hh:mi:ss')"}% |
%ENDREPORT%
_End of %FIELD{"name"}% revisions._
%ENDREPORT%
Recent Changes in all public webs
Similar to the
WebChanges.
(I'm
not sure that this
%URLPARAM{...} could work...)
%REPORT{source="webs" order="name" filter="public='1'"}%
---+ Recent Changes in %FIELD{"name"}% - Web
%REPORT{source="topics" link="web" order="-time" row_count="%URLPARAM{\"rows\" default=\"50\"}%"}%
---++ %FIELD{"name"}% - %FIELD{"format_time(current.time, 'dd.mm.yyyy hh:mi:ss')"}% - %FIELD{"current.user.name"}
%FIELD{"summary"}%
%ENDREPORT%
%ENDREPORT%
See
[[%SCRIPTURL%/view/%WEB%/%TOPIC%?rows=50][50]],
[[%SCRIPTURL%/view/%WEB%/%TOPIC%?rows=100][100]],
[[%SCRIPTURL%/view/%WEB%/%TOPIC%?rows=200][200]],
[[%SCRIPTURL%/view/%WEB%/%TOPIC%?rows=400][400]]
most recent changes.
Large Attachments
Gets a list of all attchments larger then 1MB.
| *filename* | *size* | *topic* | *uploaded from* | *upload time* |
%REPORT{source="attachments" order="-size" filter="size > '1000'"}%
| %FIELD{"name"}% | %FIELD{"size"}% | %FIELD{"topic.name"}% | %FIELD{"user.name"}% | %FIELD{"format_time(time, 'dd.mm.yyyy hh:mi:ss')"}% |
%ENDREPORT%
Statistics
Produces statistics like the
WebStatistics, without "views", because "views" are not available.
(Could be done if the log were a embedded table.)
Group by month to get monthly revisions.
%QUERY{name="monthly_revs" source="revisions" calculate="format_time(time, 'yyyy-mm) as month" groupby="month" aggregate="count"}%
Group the implicitly generated detail query by user and group (this is the link to monthly_refs).
This results in a row per user and month.
%QUERY{name="monthly_user_contibutes" source="monthly_revs_detail" groupby="group, row.user.name as username" aggregate="count" }%
<TABLE>
<TR>
<TH>Month</TH>
<TH>Topic Changes</TH>
<TH>Top Contributors</TH>
</TR>
%REPORT{source="monthly_revs" order="-month" }%
<TR>
<TD>%FIELD{"month"}%</TD>
<TD>%FIELD{"changes"}%</TD>
<TD>%REPORT{source="monthly_user_contibutes" link="group" orderby="-count" row_count="10" }%
* %FIELD{"count"}% %FIELD{"username"}%
%ENDREPORT% </TD>
</TR>
%ENDREPORT%
</TABLE>
Top Ten Contributers
Counts only the new revisions a contributer has made.
%QUERY{name="contributions" source="revisions" groupby="user" calculate="count, min(time) as min_time, max(time) as max_time" }%
| *name* | *contributions* | *between* |
%REPORT{source="contributions" order="-count" row_count="10"}%
| %FIELD{"user.name"}% | %FIELD{"count"}% | %FIELD{"format_time(min_time, 'mm.yyyy')"}% - %FIELD{"format_time(max_time, 'mm.yyyy'"}% |
%ENDREPORT%
My Contributions
A List of every topic I ever contributed in.
%REPORT{source="revisions" filter="user.name='%USERNAME%'" order="topic.name"}%
* %FILED{topic.name}% rev. %FIELD{"revision"}% - %FIELD{"time"}%
(current rev. %FIELD{"topic.current.revision"}% - %FIELD{"topic.current.time"}%)
%ENDREPORT%
--
StefanSteinegger - 21 Mar 2005
Discussion
Stefan, I am not able to give you a thorough design review due to lack of time but I think what you are trying to do is very worthwhile.
I have one idea to make this a little easier: Why don't you
- generate a set of tables (as you describe above) in the cache and then
- use DatabasePlugin (or similar) to query that cache.
I realize you said you don't like SQL, but this would give you a quicker and standard way to achieve your goal, and it would make it possible for people to use either TWiki or an underlying database interchangeably.
--
ThomasWeigert - 23 Mar 2005
Thanks for your feedback Thomas. It's not the problem, that
I dont like SQL, I know it inside out. But I don't want that TWiki users have to cope with.
Secondly I dont want that the Plugin (and TWiki at all) is too complicated to install, because one have to install a Database to run it cleanly. Then you start to question why there are textfiles to store topics, if you need a database anyway. But this is just a
cache, original data is still stored in the topics.
But it's possibly the quicker way to get any results. I'll think about it. There is certainly a driver for a database in a textfile or something.
--
StefanSteinegger - 23 Mar 2005
It's re-inventing the wheel a bit (
FormQueryPlugin already implements much of the above) though it has some interesting ideas, and I guess I need to comment.
It has been my intention for some time to re-do the
FormQueryPlugin, treating it as a prototype. But my thoughts were taking me in a couple of different directions:
- I want the core to implement a TopicObjectModel - basically do what the DBCacheContrib currently does, but within the core.
- I want to provide a variety of query interfaces to this data - FormQueryPlugin expresses one, this idea expresses another. The one I am most interested in leverages the CPAN SQL parser to implement a subset of SQL.
- Thomas's idea of using a DB as a cache is an interesting one. Any thoughts about performance (populating the cache)?
- To me,
text is just another field in the data model. I want to be able to plug in different "parsers" that extract structured data (such as TWiki tables, headings, section tags etc.) from topic text and incoporate them in the data model.
If there is synergy with your thinking, Stefan and Thomas, I'd be delighted to work with you on this...
--
CrawfordCurrie - 23 Mar 2005
I hoped that you comment my idea, Crawford.
I like your
FormQueryPlugin very much.
As I already mentioned: my
ReportPluginIdea is basically the
FormQueryPlugin, with some differences
- The data structure of the FormQueryPlugin is tree-like. So you have to "flatten down" the data, if you want to query branches like revisions or attachments using
extract. It's the power of relational databases that this is not necessery, because there is no root. You can start querying wherever you need. It should be business of the report to build trees, not of the model.
- I introduced a group by feature, this is very powerful.
- I never liked the
format parameters in %SEARCH{...}% and similar functions very much. It's ok to format a single line. You can write nested %SEARCH{...}%'s, using escape characters. This is actually not readable for human beings after the third level
I'm not sure if FormQueryPlugin doesn't allow nesting %SHOWQUERY{...}% or if it suffers the same syntactical problem. So I think that %REPORT{...}% and %ENDREPORT% could be incredibly cool.
- I use uniformed tables, instead of hashes.
Did you notice how
easy it is to produce this queries and reports?
Of course we need some more "case studies" to decide if it covers all the possible needs.
But I'm confident that many requirements would be ridiculously easy to satisfy.
Interface to other plugins
Your idea of the query interface is interesting.
I didn't care about so far.
The
text field
is part of my model (but not part of the cache, a fact that users never have to care about).
I would only put structured data into the cache.
I propose the interface to plugins like that:
- plugin triggers (like the
commonTagsHandler) could be fired while parsing the data
- plugins use an API to create new tables and put structured data in it.
- %REPORT could be used to render the plugins data.
- The plugin can use the API to access the data in the model and provide TWikiVariables to do anything.
about the Cache
I'm not sure that we need a Database.
I believe that the structured data could be fully in the servers memory.
SQL could be more complicated then just using hashes and arrays.
Then we need a mechanism to find out if the chache is out of date.
- The cache has a last update time
- Every topic that definies a dynamical table (form, edit table definitions) is checked first. If it is newer then the cache, the whole table has to be updated.
- all topics are checked if they are newer then the chache and updated.
- delete records of topics that dont exist anymore
About the Links
I need links between already joined data tables similar to the
FormQueryPlugin.
There is no need to filter a cartesian product to the set of right combinations.
And it could be really fast.
Tell me how to join tables easier!
And its used for the nested reports to link inner and outer parts together.
Linking of user definied data tables using
%LINK{....}% would allow the user to use this feature on user defined tables as well.
--
StefanSteinegger - 23 Mar 2005
Guys, I am extremely delighted that you are looking at this. One shortcoming that TWiki has is the lack of support for generating complex, fast queries against its topics, with the topics interpreted as structured data. And I do agree, the topics should not be viewed as tree structured data, but as a set of relational tables. Or at least, both views should be possible.
FQP is a great start, as Crawford described, but
- I like the thinking that Stefan is describing, and
- There seem to be enough deployment issues with FQP that scares people away.
I still wonder why we could not create an efficient representation of a TWiki web in a cache and then use existing relational query capabilities to do the queries (I am not talking about rolling our own SQL data base here, but to have some minimal support for computing joins).
--
ThomasWeigert - 23 Mar 2005
I still think that serialized arrays and hashes are good enough... Don't forget it's only a cache for faster access. The data is already stored in files... The model is a way to think not a way to store data.
By the way: I suggested
webs to be a table in the Basic Structure. So there should not be a cache per web.
--
StefanSteinegger - 23 Mar 2005
You may get yourself into a performance issue if you have a cache for all of TWiki. In our installation there are many webs with thousands of topics, and
PeterThoeny reports even larger numbers at his installation. Many of these will never be queried by this mechanism.
--
ThomasWeigert - 23 Mar 2005
Thats why I think that only structured data should be in the cache. If you have to read th whole topic text body, you can read it directly from the topic files.
Most topics are composed of the topic name and the revisions (date, revision number and link to the user). If the topic has a form, there is the formname and the formdata (only values, no fieldnames, they are part of the table definition). Then attachments, if there are: name, size, user (link), date, revision, I forgot the comment. Lets say an average of less then 1 kByte per topic. For every thousend topics you'll need less then 1 MB of chache...
It's certainly a good idea if you can put some parameters about data that is used: which webs, which forms, which embedded tables and so on.
--
StefanSteinegger - 23 Mar 2005
I think the key items are:
- All metadata
- All tables in text
I am less worried about attachments, but that would be covered by metadata anyway...
--
ThomasWeigert - 23 Mar 2005