MetadataSucks < Codev

This item is about meta data philosophy and performance.

This is an appeal to return to the essential "wikiness" of TWiki topics.

Metadata was added to the raw text of topics to make certain operations and structures, such as forms, possible. These features effectively try to turn a twikiweb into a database. Unfortunately, they turn it into a very, very bad database. It has long been known that UNIX file structures really aren't the best way to store large volumes of data for fast access.

I'd like to see all the metadata purged from wiki topics and a return to their essential plain-text wikiness.

Of course, that doesn't mean I don't want all the features metadata currently give me :-). It's tremendously powerful to be able to use a twiki web as a simple database; I just think it could be done in a much much better way.

Has anyone considered (e.g.) using a MySQL DB to store metadata? It would be much faster.....

-- CrawfordCurrie - 24 Jan 2003

My understanding that the original design parameters of TWiki were specifically set to avoid using a database. Whether this decision needs to be revised now, or in the future, is an ongoing development topic.

While I admit putting data which may have a certain nature in a flat file may not be the most efficient choice for a given piece of information, the decision as to where it properly should go is usually about either a design tradeoff or a choice of immediate development expediency. Nevertheless, the real problem with METADATA is not the CoreTeam's choice of including it in a flat file, but the lack of a robust API for accessing, modifying, maintaining, and extending the METADATA. A consistent usage of such an API within the core code and plugins would nullify this problem.

If the METADATA API issues were solved (along with a few tweaks of the core code's main data API to better hide the METADATA in its current implementation), the only legitimate complaints of any given API's back-end implementation over another would be centered around how to improve the overall speed of the entire application or the maintainability of the backend API code.

-- TomKagan - 24 Jan 2003

Attempting to use a backend database for key pieces also begins to introduce difficulties in properly backing up the system.

My guess is that ultimately this discussion will be very similar to debates on version control systems where some tools use common text files for storage (RCS, CVS, Perforce, Bitkeeper, etc) vs other systems that lock the key pieces in binary files or databases (Clearcase, Sourcesafe, etc).

I personally I would like Twiki to still use text files for meta-data, though we might want to consider separating meta-data from the text itself. Something like the mac resource fork vs data fork.

That having been said I would also like to see Twiki move to a cache system for meta-data using something like SQLite. My theory here is that that the db cache could be updated at the same time as the text files and you could easily recreate/bootstrap the db from the text files. But the text files would be the ultimate authority and thus would minimize the backup issues etc.

-- JohnCavanaugh - 24 Jan 2003

I think having a non-text method of storing metadata isn't such a big deal as long as the file is still stored within the twiki/data tree. For example, a db3-file or something similar would be acceptable to me. That doesn't really create many problems with backup, since you still can just back up the data and pub trees and be good to go.

Having something like parallel binary and text metadata representations sounds unnecessarily complicated and error-prone to me, at first glance. You may say that the text version is "authoritative", but if you bother to check it on read, then having the binary version would be pointless from a performance perspective. If you don't, then you can get synchronization issues unless you're very careful in the implementation.

So I would go with a single or perhaps per-web unified metadata store file, stored in the data tree with all other topic data. How this file would be formatted would be something I'd have to think on more before I could define what would be optimal.

-- WalterMundt - 25 Jan 2003

Some points:

Keep metadata independent, accessible only through API.
Organizations which would like to use DB for various reasons (performance etc.) should, with a set of options, have automatic mechanism for individual elements such as text, metadata (and structured data that I am interested in) in DB.
Be aware that WebDAV will give alternative interface to twiki. In WebDAV, the property-value pairs are available in XML.
While use of Sleepycat DBM is good idea, there are many issues that crop up when you are in multi-process environment. (For e.g. a process dies while keeping dbm environment open etc - and other processes can't get a lock.)

-- VinodKulkarni - 25 Jan 2003

While I want to be able to use a database for metadata, i would hate to see it become necessary. same thing applies to the revision control, when I first set up the windows based twiki i just turned it off, and had a fully funtional wiki, without versioning. I am somewhat dismayed at the use of metadata on every page, and the Dev pages in the Plugins web are a good example of a form that does not apply to that topic. It would be worthwhile thinking about using plain text to put meta data into the page -

Parent: MainWeb

is just as effective as %META{"Parent"..... whatever....

and is much more wiki-ish

forms and workflow can be similar - I did something like that before the current system existed, but i left unisys too soon after implementation.

This could could fit in with the Settings in tables feature (maybe..)

-- SvenDowideit - 25 Jan 2003

Here's another argument for keeping metadata in the plain-text files. At my workplace, the fact that all business data was stored in plain-text was a major argument in favor of Twiki because decision-makers were reluctant to introduce new databases to the company. It's a major benefit in terms of recoverability that all business data is accessible as plain-text in case the server fails for some reason.

Some companies have Linux experts at the ready, but ours does not, so the argument of "just using MySQL" is not as trivial as it sounds to most Linux people. To me personally, it'd be a total showstopper, but then I'm not a Twiki-admin either.

I like Twiki because the front-end is so brilliantly simple, and it's reasonably simple on the backend -- basically all you need is a web server and Perl. For every added requirement, the technical barrier gets higher. Not everybody in Twiki's target audience is a skilled Linux operator.

Besides, given that the current twiki.org relies on plain-text metadata and it's indeed reasonably fast, I don't see much problem with the current implementation in terms of efficiency. It could probably become 0.2 seconds faster per page view, but I'll probably not notice that.

Storing metadata in plain-text also allows me to "hack" the parent property by simply adding that metadata line in the beginning of the edit view. How else would I change the parent of a page that easily?

-- TorbenGB - 25 Jan 2003

There are certainly imporvements that can be made to the existing meta data system e.g. speed and cleaner API for access. However, I think there are significant advantages to the current approach:

Meta data goes with the topic - so copy a topic and you've got it's meta data
As stated above - keeps TWiki simple, you just need Perl and a Web server

So I think we should keep it, but that doesn't stop TWiki being able to support it being stored elsewhere. I can imagine a future version of TWiki that uses RCS and meta data as now, but that can be configured to instead move all data into a database if required. I've made some moves towards this in the code, but lots more needs to be done.

-- JohnTalintyre - 25 Jan 2003

My first reaction to seeing this topic is "meta data certainly does not suck." It certainly should not be ripped out completely. That seems absurd to me. In fact one of the reasons I think TWiki is extremely interesting is it's use of meta data. From an information standpoint, this has been a missing feature of the web (in general) for many many years.

Regardless of wikiness, meta data is inherently part of what has become "TWiki-ness". It may not be as easy to use as people would like (APIs, etc). If so, please make constructive and actionable recommendations for either changes to the core code or plugins.

Please also remember the TWikiMission when considering making changes to the required system footprint. TWiki.TWikiSystemRequirements

-- GrantBow - 25 Jan 2003

Well, I didn't say what metadata sucked, though I implied it sucked performance and there seems to be some agreement with that view.

As I said I don't want to lose the features of metadata, they are terrific, it's just the storage mechanism. And the principal reason I don't like that, is because I'm trying to use a wiki for something it wasn't designed for - as a database.

Reflecting on the discussion above, what I really want is the recognition that data and meta-data are different. If the Store module knew about metadata I could always plug in an alternative implementation that uses a mysqldb or whatever, without breaking the default "embedded" implementation. The current API is struggling in that direction already.

-- CrawfordCurrie - 27 Jan 2003

I just reamemberd one reason why i like metadata where is it, and haven't figured out a good way to do it in a dba.... versioning!!! the current flat file metadata format is a versioned database that copes reaonambly with schema changes. If we provide a script that goes through the pages fixing them up is the template changes, we then have a versioned database with on the fly schema transformation.

-- SvenDowideit - 27 Jan 2003

( moved comment from AntonAylward to PerlDBI topic and my last post. -- GrantBow - 03 Feb 2003 )

I apologize for jumping to in incorrect conclusion about the purpose of this topic.

I agree that data and meta data are distinct. There need to be better ways to handle the two types of data. I feel that NOT versioning meta data would be tricky and remove the simplicity from the current KISS implementation. There may be very good reasons to do this, but they need to be clearly outlined. The UI design might be especially tricky.

A new meta data storage implementation could be wonderful depending on how you implement it. I am confident that good ways can be found to address whatever issues arise.

I recommend that the first step in implementing a better meta data storage would ideally be to add hooks for expansion. Perhaps MetaDataHooks or maybe MetaDataAPI might be good new topics to begin to work out the implementation requirements. This would allow new meta data storage implementations while at the same time allowing the possibility of leaving the default method in place. Updating the API seems the best way to go.

Crawford and others, what are the most relevant and important topics that already exist that relate to metadata API changes and/or performance?

-- GrantBow - 03 Feb 2003

I also apologise. I was taking a more abstract view of metadata ....

There's data, metadata, yes and .... well .... metadata-of-another-color. Perhaps this is why my PerlDBI comment was not appropriate here. I was refering to that "wasabi-metatdata" (wasabi being horseradish of a different color).

Lets look at the categories here.

DATA: What the user sees when browsing the wiki.
METADATA: Keeping track of how the data got to be what it is and all the data that isn't simple, linear text, such as attachments and forms.

I'm tempted to call that structural metadata. Comments?

Wasabi-Metadata: Things the user doesn't see that affect the operation of Twiki. The one I'm primarily concerned with is access control, which certainly should be no more visible to the user when editing text than the other metatdata, but whose functionality is completely seperate.

While on this issue of othrogonality, I would venture that revusion-control-metadata is orthogonal from other-than-linear-text metadata. In one sense we have that right at the moment with a mechanism for forms and attachements and a mechanism (RCS) for revision control.

Anyway, I'll take my wasabi-metadata ball and play elsewhere with it wink

-- AntonAylward - 03 Feb 2003

Out-of-band leakages are always fun: Ever noticed that TWIKI keeps its meta data right next to the text?:

$ more Cleaver/WebHome.txt
%META:TOPICINFO{author="MartinCleaver" date="1048908060" format="1.0" version="1.12"}%
Welcome to Main.MartinCleaver's Wiki!

With no separator between the %META: line and the topic I venture that I could stuff something into the meta just by putting it at the start of the topic. Am I right?

-- MartinCleaver - 14 Apr 2003

Martin, AFAIK you are very much right! In my post above I'm making a case for why this is so very useful. In your example just now you used the metadata TOPICINFO, but I often use this hack/shortcut/? to update the PARENT metadata. I don't know any other way to update the parent reference, but the ability to "hack" it like this is extremely useful and I would be sad indeed if this ability would be removed.

Well, IRT setting the parent, there is an option for that under the "More..." link of each topic. -- WalterMundt - 16 Apr 2003
Wow, I hadn't found that feature. But it only works within the same web. Also, it's a bit cumbersome, to go to More, to set the parent, to edit the page (and not change it?), and to save it. If I need to edit the page anyway, why not simply "hack" it - is there a drawback? -- TorbenGB - 16 Apr 2003

Well, IRT setting the parent, there is an option for that under the "More..." link of each topic. -- WalterMundt - 16 Apr 2003

Wow, I hadn't found that feature. But it only works within the same web. Also, it's a bit cumbersome, to go to More, to set the parent, to edit the page (and not change it?), and to save it. If I need to edit the page anyway, why not simply "hack" it - is there a drawback? -- TorbenGB - 16 Apr 2003

-- TorbenGB - 15 Apr 2003

Just my grain of salt: In my view, metadata are different from text, because:

they must be handled by the machine, not humans
- modifying them should be quick, with no locking problems: basically one could have a perl API to "set METADATA_X to VALUE" in one (nearly) atomic operation. this cannot be done if they are stored in the text (users can keep the topic locked for an hour)
they should be queried (somewhat) efficiently

But, I chose TWiki because it didn't use a database, for robustness reasons.

So, maybe a solution would be to store metadata in separate .meta files besides the topic. Metadata would then be stored one per line, such as NAME VALUE pairs.

My feeling is that handling a lot of these small files is something that current machines can perform fast enough to be practical. And we will gain the CPU overhead of scanning big .txt files for metadata strings.

Plugins or user code could also add their custom metadata easily with this scheme.

-- ColasNahaboo - 15 Apr 2003

I'm another user who chose TWiki specifically because it doesn't require a database. I may want to change hosting providers in the future and I don't want to have to deal with database issues. Storing things in plain files is much better for me.

-- ChrisRiley - 15 Apr 2003

The idea of a .meta file per topic is beautiful! Mind you, I'm not a TWiki-developer (yet) but the conceptual side of it is nice and simple! We already have a .txt file for the topic and a .txt,v file for the revision control; adding a .meta file would fit nicely and be logical enough to work with even on the server-side file structure. But it is a major TWiki change... (Me three, I prefer text files over database for reasons of manageability.)

-- TorbenGB - 16 Apr 2003

I'm unclear why putting meta data in a separate file would be an advantage. At present the meta is under version control with the topic and scripts like rename can work on it in the topic text just as they can for other topic text.

In the TWiki code the meta data is separated at load time and held in data structures. This is why it is not available to the plugins. It has been suggested that it is passed around just like rest of the topic. So we appear to have two opposite suggestions. I kind of like it being a bit special is it is at present, but I can't see the point of a separate file. I can see the point of allowing the current mechanism or a database. The database would allow faster searching and could ensure more consistency.

-- JohnTalintyre - 16 Apr 2003

There are pluses and minuses for a separate .meta file:

Scaling issue due to file system limitation. At work we have several webs with many thousands of topics. There are issues if you have more then, say 20K files in a directory (sluggish performance; failed grep due to shell limitation)
/ Handling two sets of files, .txt, .txt,v, .meta, .meta,v raises question of complexity, out of sync issues
Possible to update .meta programmatically without lock issues

I agree with John, stick with the current flat file system (simple) for now, and offer the choice of a flat file system or a database backend in the future.

-- PeterThoeny - 17 Apr 2003

At the moment the Store implementation is rather widely distributed through the code, which makes it difficult to come up with a different Store implementation. There have already been moves to decouple Store - for example, loadTopic which returns the text and the meta separately.

But the implementation is still far too deeply threaded. Many plugins rely on being able to manipulate text files. Even some of the core code (Search.pm for example) assume the Store implementation.

Shame really.

The long-term solution is to rearchitect the code; preferably using an OO approach that would allow me to say things like:

my $twiki = TWiki::new( $query );
my $web = $twiki->loadWeb( $webname );
my $topic = $web->getTopic( $topicname );
my $meta = $topic->getMetaData();
my $form = $meta->getForm( "FormName" );
my $field = $form->getField( "FieldName" );

See also TWikiOO, a thread that sadly seems to have died.

(Note: I'm coming at this from the perspective of a plugin developer. I want to be able to talk to all these objects through their published interfaces, not just the pick-and-mix subset currently exposed in Func::)

-- CrawfordCurrie - 17 Apr 2003

Time, perhaps, for me to stick my oar in again, and, perhaps, be booted out, again.

My original problem, back when I was discussing access control, was that I recognized, but failed to empahsise, that ther were different types of metadata.

Synopsis:

Some of the things labelled METADATA affect what the user sees and can manipulate
Some of the things labelled METADATA are only of concern to the system
Some of the things that afect what the user sees and can manipulate are not labelled METADATA

So, redivivi, let me suggest a "divide and conquer" approach.

There are things that we can call structural metadata and things we can call operational metadata.
One might argue about what goes in which category, but consider:

Historic Structure
- The (invisible) header on each topic that is the shorthand for the revision and last author.
  In absolute terms, this is not necessary, it could be extracted from the RCS at the cost of more computation. That this information is displayed in the frame around the topic "contents" is a nicety; many people don't pay attention to it.
- The RCS.
  One might assert that revision tracking and rollback is not a core function.
Then we have the "_what gets displayed_" Structure.
All this information is contained in the topic itself. The METADATA tag is there to tell the display engine to process this in a special manner, as evidenced by the plug-in mechanisms
- Tables
- Graphics
- Attachements
- Other Plug-in metadata
Finally we have "operational" metadata.
This affects how the topic works, how the page interacts with the user and the system and perhaps the browser.
- Access Control
  Just because there isn't a line saying "META" for access control, doens't mean it not meta-data. Access control should not be a part of the - potentially - displayable and user-editable content of the topic.
  Think about a normal file system. Think also of the old VAX VMS file system which maintained its own revision history, but access control was still "OOB".
  I also note that VMS did not keep a revision history of the changes to the access control. Ask yourself if a revision history of the access control information is needed and how that decision impact the design.

This is brief "chop-job", roughtly delineating boudaries, but I think that if we identify what is about "what the user HAS to see" and "what the system has to see" - sort of like rendering to Caeser and to God, (or as Norbert Weiner said in "God and Golem Inc", Man and the Computer) we can limit the confusion and make progress.

Some metatdata has to be accesed every time a topic (i.e. page) is accessed.
Some metadata is processed only when a topic is displayed.
Some metadata is affected only when a topic is updated.
Some metadata is affected only when changes are made to the structure of the web
Some metadata is altered every time the contents of a topic is altered
Some metadata is not going to be altered when the contents of a topic is altered.

Yes, I know that's stating the obvious, but I think part of our problem is that we've called - and hence treated - it all the same.

"Metadata sucks but some metadata sucks more than others".

-- AntonAylward - 18 Apr 2003

I understand what you are saying this time Anton and I think you make some good points. You've have brought up an issue we've (somewhere on TWiki) mentioned before: that access control stuff belongs in the metadata not in the topic.

In the limiting case, the .txt file should contain ONLY what is to be displayed.
In a perfect world it would be in a form that Twiki just put the wrapper around - the headers, menus and copyright - and and the HTML HEAD and BODY statements and tossed it out. In a simple (e.g. first geenration "Cunningham" Wiki markup is transformed. No "metadata".

The plug-ins and things have the metadata that says "invoke this special display processing" of the displayed page.

Am I making it clear here that the line saying:

Topic MetadataSucks . { Edit | Attach | Ref-By | Printable | Diffs | r1.32 | > | r1.31 | > | r1.30 | More }

is not part of the contents of the page? Its the "framing" - for want of a better term. It's a menu and refers to the RCS - the 'structural metadata', not the 'display metadata'. Its processed differently.

-- AntonAylward - 18 Apr 2003

I have previously proposed that the meta data contain a section reserved for plugins. Does this agree with your idea of 'Other plugin metadata'

WRT the other points you make I am not sure what action you think we should take beyond the naming convention. Not to knock clarifying, that in itself is a worthy goal, but does it help beyond that?

I believe so. As I said, "divide and conquer". Partition the problem into manageable parts and sole each one seperately. "Don't try eating the pig in one mouthful". The example I gave about VMS can be used to illustrate this. If you were runnning on VMS the system would already be 'doing' part of the solution for you, so you'd only have to concern yourself with the "display" metadata.

-- AntonAylward - 18 Apr 2003

BTW: redivivi is a great sounding (and apparently) Italian word but what does it mean?!

Actually its Latin - "Brought to life again" or "revisited". I 'stole' it from the album title. "Amoendi (?sp?) Redividi" which, while very Nice, dates me, doesn't it? -- AntonAylward - 19 Apr 2003

-- MartinCleaver - 18 Apr 2003

It's been a while since I've posted anything here. 8)

In response to comments by CrawfordCurrie, yes the code base is riddled with assumptions at various levels. Fixing that, at least when I was looking at it, was/is not easy. One of the things holding it back is the dependancy on RCS.

I recognise that. That is why I'm emphasising this division.
We can solve the non RCS part of the problem if we "divide and conquer". As I said "pretend you're on VMS and the OS is doing the revision history ..."

-- AntonAylward - 19 Apr 2003

I've thought for a long time that it would be better to completely drop RCS and replace it with a simple perl structure that got serialised to disk. Serialise it with something like YAML (http://search.cpan.org/author/INGY/YAML/YAML.pod), use perl Diff/Patch to generate a version tree. Would make it easy for plugins to add additional Metadata 'resource forks' directly to a TWiki node.

Possibly. Or just abstract the "history' mechanism.

One thing we must not do is put the history in with the 'display'. The RCS file means that the "display mostly" mode of TWiki is done efficiently.

-- AntonAylward - 19 Apr 2003

Best way forward probably would be to provide a YAML Object Handler that (lazy) cached the data structure in memory so that there would be one disk read per http request.

However, a full re-architechure would still be required to provide the level of access to the dateflow you wish.

-- NicholasLee - 19 Apr 2003

I've read this topic very carefully, but I cannot imagine why only metadata have to be separated into database.

I'm a TWiki admin in our company. We are trying to use TWiki instead of all company information systems, but it seems to be impossible due the main TWiki limitation - TWiki doesn't scale.

All TWiki content has to be moved into scalable environment. I suggest that MySQL is the acceptable solution. I've done number of changes in TWiki source code and it seems that movement from filesystem to database is possible. All plugins uses core TWiki or Storage API.

I prefer pluggable storage for TWiki. Default will be filesystem and if somebody would like to use TWiki in product business environment, he must be able to switch the storage.

-- OtoBuchta - 22 Apr 2003

PS: I was able to save the topic after three unsuccessful attempts (operation timeout):-(. Also the http://TWiki.org/ doesn't scale frown

Oto: Could you explain what it is you mean exactly by "doesn't scale". I can see a number of possible interpretations.

The situation with Twiki.org is that it is hosted on a machine shared with other resources that are under heavy demand. It would be better to simply say that the host is overloaded. That is not a failing of Twiki but of the hosting orgnaization. It doens't point a finger at Twiki.

There are two distinct modes to Twiki as ther are to most things on a computer in that it is read-mostly. I'd estimate I read over 20 topics for every one I contribute to. When we think of the alternative "file system" vs "database" the argument gets a bit silly. A file system IS a database, as Rob Pike has pointed out many times. The issue is one of indexing and extraction.

Some file systems are more efficient than others and some databases are more efficent than others. A database built on a file system is going to be less efficent than a database built on a raw partition, but the database is then replicating the layers of index, allocation nd retrieval that are built into the operating system and are already "in core" (or mapped as VM).

One can't blindly assert that a database is "more efficient" than a file system without a lot of qualification and instrumentation.

One also can't assert that Twiki is not scalable - again without qualification.

Are you running any other applications on the machine?
How many users are using Twiki?
How much tuning have you done of the web server?

Many studies have shown that the connectinless protocol of a web server can be a more efficent use of resources, especially in N-tier setting, than connection oriented protocols.

I've tried to empahsise in my earlier posting that there is metadata that is used every time a topic is displayed and that affects what appears in "the frame". If it wasn't for 'shorthand' and skins, this could all be stored as pure HTML. The consequence of this is that how it is stored is irrelevant. I could snapshot a particluar web with a web-walked such as WGet or cURL and view it all off-line.

In an extreme case we could throw away the revision history altogether if we decided never to backtrack.

So we have three distinct types of 'metadata':

The Topic History
Information about how the Topic is to be displayed
Access Control and toher "i-node" information.

I can see that some information about topics in a web would be better contained in a single web-level repository. I've discussed this wrt access control elsewhere.

However, try as I might, I fail to see how a Relational database - and the overhead tghat goes with that - offers any advantage over a "flat file", basic indexed database such as a Web-per-directory.

Some filesystems such as Reiser under Linux are quite efficient at indexing. Then there are issues about OS level i-node and directory caching; file name caching within Apache and so forth. Other topics in this web discuss things like mod_perl. What file system are you using? Have you tune-a-fished it? Have you instrumented your file system activity, i-node and direcotry cache hit ratios? Have you instrumented Apache? Have you related this to the end user usage profiles?

I've been using UNIX variants for over 25 years now and a lot of my client work has been tuning poorly performing systems. I've acheived inprovements of as much as 2,000% (yes!) just by repartition and configuration changes. Few systems I encounter don't need agressive tuning and configuration work.

Unless and until you have exhausted all those and unless and until you can define what you mean when you say "doesn't scale", you are flying in the face of many people who have built very scalable perl-CGI applications on Apache. While Twiki has many shortcomings, it delivers a lot with very little becuase it, in the best UNIX tradition, builds upon exisiting tools and facilities in a minimal and parsimonial manner. I'm reluctant to move to a complicated component such as a relational database for storing what can be adeqautely handled in a file system.

-- AntonAylward - 23 Apr 2003

All I want is some way to achieve a cleaner separation between metadata and text, so I can if I want choose to plug in an alternative store implementation. Now, I think the interface to that implementation will look a lot like an interface to an RDBMS, but that doesn't imply the underlying implementation.

Following AntonAylward's rules for eating pigs, we could start by finishing the job of cleaning up the Store API so it handles topics as duples of metadata and text. Then we could delegate the search functionality to the Store module. Step-by-step we can move the responsibilities where they belong, leaving a cleaner TWiki behind.

-- CrawfordCurrie - 24 Apr 2003

LOL! Well, assuming I'm not a vegitarian ... at least eating pigs avoids Mad Cow disease.

The impediment now is that we are trying to be a Python - but this is Perl. The Python eats the pig in one mouthful. I'm saying "Don't Do That!". Its not so much step-by-step as if we recognise the different types of meta-data we can address different solutions to the different parts. It makes it a number of small jobs. Like the old story about the Two Watchmakers.

If we look at the old-old VAX file system, it kept access control "out of band" - in i-node thingies. It kept revision history in "step and repeat' copies of the file and flags on the "dir" command would show either the current or the hisotrical seequence. The "oob" info said what type of file it was, text, fixed format, etc. This imformation is ORTHOGONAL.

The problem with Twiki's Metadata is

Not all of it is tagged METADATA
Some of it is to do with what gets displayed in the 'frame'
Some of it has to do with how the revisions history is done.

Now I don't give a pig's posterior whether the revision history is RCS, CVS, an indexed database, or even so pointless as to have all the overhead of a relational database. At one level I don't care if there is an implementaion that throews away all previous revisions.

I do care if how its done appears in the body of the topic. It doens't belong there.

I don't give the pig's posterior if the access control information is stored on a site, web or topic basis, if its in a text file, a DB file or what.

I do care if how its done appears in the body of the topic. It doens't belong there.

From an implementors POV, putting revision and access control in the Topic is a example of a highly coupled system, which has for at lweast the three decades of my experience in programming been regognised as a BAD THING. You can check the early papers on Structured Programming, such as Meyer's "Reliable Software Through Composite Design", from 1975 Later advances, OO, UML and so forth have not invalidated that. What we have at present, by putting metadata in the topic ".txt" file is a classic example of "content coupling". In code, this is like jumping from one module into the middle of another module. It is highly pathological and makes maintenance much simpler.

Because we have 'stuff' that has nothing to do with dislplay embedded in the Topic pages, the developers are not free to redesign the parts of Twiki that deal with reisions and with access control. When a developer builds the access control module he should, by basic principles of good design that date way back, suppy an interface that answers the questions:

Can I view this topic?
Can I edit this topic?
Can I move this topic?
Can I create a topic in ths web?

We can see from the equivilent questions about files in a file system that the answers are not dependant on the content of the file and can be inspected and altered without altering the contents of the file.

Trying to deal with a higly coupled system is a pig. One I don't care to try to eat all in one go.

-- AntonAylward - 24 Apr 2003

There are more conversations about this at MetaDataDefinition, BTW.

Aside: I was reminded of that topic when I started thinking that to improve the semantic capabilities of TWiki I'd like to have a TopicAgreesWithMetaDataField field in each topic so that both the reader and twiki itself can follow (a/multiple?) lines of inference. Of course, this functionality would have to be in a plugin but the information required is per-topic and meaningless without that topic. It therefore belongs with it. The meta data is the best place for it as I want to programmatically manipulate that field and wouldn't want the ordinary user messing with it.

-- MartinCleaver - 27 Apr 2003

An interesting observation but not altogether valid. Yes, the metadata is meaningless without the topic. Essentially that means it can be ignored.

It is here the analogy with a file system breaks down. The metadata of a file - the inode - is precursor to the file. Twiki doesn't work like that. A simple bit of experimentation will show you that the .txt file is the precursor.

You can have a .txt file that is the target of a WikiWord in another topic that exists just as a .txt file but which does not have a .txt,v file
You can have a .txt file that is a valid topic but which does not have any access control information in it.
You can have a .txt file that is a valid topic but which des not have any METADATA lines in it.

(If you don't beleive me, perform this experimentation for yourself. The Twiki code is very tollerant.)

Similarly, if the access control for a topic Main.SomeSuchSubject were in an external site- or web-wide database and the topic file was absent it wouldn't matter. Twiki looks for the topic file before going down the path of TWiki::Access::checkAccessPerm().

You might also inspect the flow of the code for a "view" operation, for example. TWiki looks for the .txt file not for the .txt,v file.

Finally, if we do consider the file system analogy, we also have to consider that no real production quality file system operates without error handling and such tools as FSCK. Integrity constraints are fine, but ther are various ways of implementing them, and putting everything in one file is a pretty primitive way. I'm reminded of the changes made to the UNIX file system in the transition from V6 to V7. Recognising that metadata and data should be treated differently when deleting a file improved file ssytem reliability enourmously and made the logic that underlied FSCK possible.

I do note, however, Martin, that much of your comment is only valid if one globs together all of metadata as one conceptual unit. The point I'm trying to repeatedly make is that there are distinct and orthogonal (i.e. non-interfering) forms of metadata.

-- AntonAylward - 27 Apr 2003

I've just had occasion to do some debugging of a piece of code that isn't relevant to the group at large, but in doing so made an observation.

The current method of daling with preferences involves slurping up the various preferences files and stepping through them line by line. No, not reading them in line by line.

Compared to this, keeping prefences in a DBM hash file would be an enourmous boost to an execution path that is critical to all operations within Twiki.

-- AntonAylward - 01 May 2003

Lets check, as I think we agree on this.

MetaData is dependent on the topic, without the topic the meta data is meaningless.
There needs to be multiple types of meta data.
1. There is meta data for the revision system
2. There is meta data for the permissions system (but it is (IOO badly) currently embedded in the page
3. I said I wanted space reserved in the meta data for plugins
4. (added after Anton agreed) There is meta data for category tables
5. The page hits counter is also metadata
6. The revision number of a page that a particular user has seen is also meta data

Are you saying that it is not possible to have meta data reserved for plugins as this would require it to be stored in the same place? I don't think you are (this would violate the physical layer separate from logical layer view).

-- MartinCleaver - 01 May 2003

Yes, to all. Nice observation "logical" vs "physical".

I also admit that we are not going to make all these changes in one go.

However recognising that they are different problems will let us solve the problem in parts.

"Simon's Watchmaker" principle applies here. http://commons.somewhere.com/rre/2001/RRE.Hierarchy.and.Histor.html
http://www.google.com/url?sa=U&start=23&q=http://winf.at/~klaus/ahl-hierarchy-theory.pdf&e=6251
http://www.cs.ukc.ac.uk/pubs/1998/714/
(Of course it helps if you have a copy of Herbert Simon's classic and seminal paper to begin with)

Synopsys of Herbert Simon's "Watchmaker" principle

"Suppose each watch consists of 1000 pieces. The first watchmaker constructs the watch as one operation assembling a thousand parts in a thousand steps. The second watchmaker builds intermediate parts, first 100 modules of 10 parts each, then 10 subassemblies of 10 modules each, then a finished watch out of the subassemblies, a somewhat longer process, 110 steps longer.

It would seem that constructing a watch in a single sequential process would progress faster and produce more watches. Alas, life being what it is, we can expect some interruptions. Stopping to deal with some environmental disturbance, like a customer, the watchmaker puts down the pieces of an unfinished assembly.

Each time the first watchmaker puts down the single assembly of 1000, it falls apart and must be started anew, losing up to 999 steps. Interrupting the second watchmaker working on a module of 10 using hierarchical (in the first sense) construction means a loss of at most 9 steps.

For organizing complexity, the moral is this: taking a few extra steps in the short run, saves many steps in the long run.

In anything less than an environment of no change, the second watchmaker will be much more successful in finishing the complex whole. Using an elegant mathematical demonstration, Simon shows how dramatically more successful the modular-levels principle is in producing stable and flexible complexity. Nature, he says, must use this principle. And, indeed, systems scientists have extensively documented this level pattern of organization, whether physical (such as particle, atom, and molecule), biological (like the example of cell, organ, and body), social (for example, local, regional, and national government), or technological (one example is phones, local exchanges, and long-distance networks"

-- AntonAylward - 02 May 2003

The second watchmakers approach also allows for another key aspect of the assembly of complex systems; modular test. Even more radical, write the tests first that express what we want to do with metadata.

BTW, I just had yet another user complaining about the performance of one of our twikis. "Why did you upgrade it?" (to Feb 2003) he asks "It did almost as much before and was a heck of a lot quicker".

-- CrawfordCurrie - 02 May 2003

Re the performance, have a look at SiteMapIsSlow and see what you think. Apart from that, I don't think the performance is that different, but I haven't done any real side by side tests.

-- RichardDonkin - 02 May 2003

Re: Watchmakers. I was brought up to view things like that and have been amazed at a later egenration of programmers that 'design and implement' on the fly with little forethought for the systemic architecture and none for the testing. My early mentors had a "Code! Why that's the last thing you do" approach.

Sadly, I've seen many big (gt 100 people, gt 5 locations, gt than 20 IBM SP-frames, gt 5 year project lifetime) which lacked modular testing ro even a coherent test plan or such basics as 'strcutured walkthroughs'.

Did I mention instrumentation? Unless you make meaningful measures, and looking at your watch is not adequately meaningful enough, we may as well be throwing dice.

See MetaDataSucksMeasurement -- i played around with some profiling and instrumentation tools and the results were interesting.

I'm not prejudiced against relational databases, its just that this isn't a relational database problem.

However, by the modularization principle dicussed above, I'm proposing addressing ACCESS CONTROL. Yes, all variables can be migrated to a DBM hash, which would allow for some regular GUI-stuff and probably simplify a lot of other code as well. But access control is a distinct case. The variables used in, for example, skins (See TWikiSkins) are not the access control ones.

We should only need good desgin principles to justify putting access control out of band. It isn't going to degrade performance and opens up avenues of simplified and more modular approaches to the GUI and to the code heriarchy.

-- AntonAylward - 02 May 2003

The benefits of all this I've already been over.

Modularisation by seperation of access control representation from the topic (making it out of band data) and handling (by making it a DB and not "Set" statements.
Orthogonality, since the access control code does that and only that and can be tested and replaced independantly of other settings.
Performance. While the individual speed gain may not be noticable, the reduced processing will reduce the overall loading on the machine.
Scalability. Since the present (01Feb2003) implementation has sto step through a number of topics line by line, a DB based access control will be independant of the size of the topics or the number of "Set" statements that have to be parsed.

-- AntonAylward - 02 May 2003 (and refactored 4 May as Martin suggested)

Addad a couple more metadata types.

-- MartinCleaver - 06 May 2003

Independent of that I think you misunderstand my point above, let me ask you the following... What is better, the concept of the registry data base in MS Windows, or the concept of resource forks in MacOS? Anybody who had to upgrade the OS on both of these machines will be quite clear on the answer, I would venture...

The point is, things are not as absolute and clearcut as you make them seem to be.

-- ThomasWeigert - 05 May 2003

I copied the last comment from MetaDataSucksMeasurement so I could ask this follow-up question:

Which is better — the registry or MacOS resource forks? (I never fooled with Macs at that level).

-- RandyKramer - 06 May 2003

Just to add fuel to the fire, here's a couple things I was asked for today, by some sophisticated users who are using forms to implement and issue tracking database:

Searches constrained to a single meta-data field (can do with %SEARCH% but....)
Inclusion of a single meta-data field value in another topic (sorta %METAFIELD{Topic,FieldName}%)

-- CrawfordCurrie - 07 May 2003

WebForm
TopicClassification	FeatureBrainstorming

Topic revision: r53 - 2003-05-28 - MichaelRausch

Account
- Log In
- Register User

Edit
Attach

Ideas, requests, problems regarding TWiki? Send feedback. Ask community in the support forum.
Copyright © 1999-2026 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.