Tags:
meta_data1Add my vote for this tag store1Add my vote for this tag create new tag
, view all tags

Storage of Meta Data

Synopsis: On HandlingOfMetaDataInTopicText a subtopic emerged dealing with the question of whether meta data (not "real" meta data, but what TWikiMetaData) should be stored as part of the topic, or somewhere separate.

I have factorized this discussion out of the original topic as it is somewhat unrelated to the requirements discussion of what should be done when a user enters meta data in text. There are other topics where this is been discussed which I will try to find as well...

-- ThomasWeigert - 10 Feb 2004

Issue: How should meta data be stored?

  • SD: there is no METADATA, its all data, with some values named and so accessable like METADATA is currently

Proposed Spec 1.1: (ColasNahaboo) Completely separate metadata from the text, and make meta data "open" (i.e. anybody could add its own metadata).

  • A topic would be comprised of a metadata block and a text block (or "fork" in macintosh linguo, or "part" for mime documents)
  • The metadata block would consists of N metadata, each consisting of:
    • A name (any string? an alphanumeric identifier? a WikiWord? ...)
    • A type (As for TWikiForms or EditTablePlugin: text, select, date...). Type could be of the form: WEB.TOPIC.field with WEB.TOPIC containing a TWikiForm definition with a line for field "field". Type could optionnally include access control modifier, e.g:
      text,16,allowedit=Main.AdminGroup : A text field of size 16, editable by the Admin Group only.
      Clicking on Edit would only edit the text block, keeping the metadata block as a hidden HTML field. One could also edit the metadata block, each field being edited in a way specific to its type.
    • A value (question: do we allow newlines in values? do we provide a way to escape them?)
  • Implementation: ASCII defines already control codes, so why not use them (we already use newline smile See http://www.cs.tut.fi/~jkorpela/chars/c0.html Possible use:
    • ctl-] (GROUP SEPARATOR) to separate medatada block from text, ctl-^ (RECORD SEPARATOR) to separate each meta record, etc...
    • simpler to parse: Start and end the MD block (at the head of the file) with a any special control char we agree on, for instance ^L (this way, plain text files would be recognized easily, and parsing metadata would be fast even for huge files).
    • Have each metadata line prefixed by a special control char. (e.g. ctl-^ ). Would be quite easy to parse grep-like
History has proven that this simple generic property/metadata system is very powerful and flexible (see X Window System properties), especially if we add the ability to have plugins/sripts be triggered on property changes.

Proposed spec 1.2: (SvenDowideit) remove metadata to a meta-web that is paralell to the topic web.

  • Codev would have a Codev-meta web
  • would allow us to have the metadate in a database, while the topic text is still in StoreDotPm
    • CN: can be dangerous because of inconsitencies, and data being in a DB, not in file
    • SD: ?huh? i don't see the danger, any more than 2 topics being in seperate files.. (its all in how you use it to my mind)
  • reduces the namespace meaning of webs smile
  • implemented like an automatic INCLUDE

(Somehow John's 1.3 proposal got mislaid; I added it again)

Proposed spec 1.3: (JohnTalintyre) keep meta data in same file as topic text, allow alternative storing e.g. in RDBMS

  • Key current mechanim as default offering
  • Design of meta data code was always intended to allow such an extension.
Note that Search code needs altering to also allow extension there
  • Consider how to allow new meta data to be added
  • Having hooks to allow post-processing after changes
None of this is very Wiki like. As stated elsewhere you get this in addition to Wiki behaviour.

Contributors:
-- ColasNahaboo - 08 Feb 2004
-- SvenDowideit - 08 Feb 2004

Discussions

Let me present a view of the user, not coder. [...]

Please don't go to database-based solutions! The plain txt files is a big advantage of Twiki!

-- AndrzejGoralczyk - 09 Feb 2004

I believe TWiki should always support simple file system storage. Use of a database would simply be an alternative.

I really don't see that from the user's point of view it should matter if the meta data is stored in the topic or not. (Mind you good point that form data isn't really meta data. Mind you if you associate the text box with the "topic text", then the form data is arguably meta data) But, there are times when the meta data being in the topic is exposed to the user - search results.

-- JohnTalintyre - 09 Feb 2004

John, from a user's perspective it doesn't matter at all where data/metadata is stored. The only argument for embedding it in a text file is .... it's easy to do it that way, because that's the way its always been done. A simple file system is a database, just a very inefficient one (see; ADatabaseCalledTWiki). I'm not actually that bothered how the metadata is stored on disc; but I am bothered about glaring inefficiencies that keep getting brushed under the carpet. Moving metadata out of topics allows significantly more efficient access to both metadata and plain text topics to be developed. If it is done, there is no need to jump through hoops to separate SEARCH and METASEARCH. Code in several places becomes simpler and faster. Content import and export becomes easier. There are just so many advantages!

-- CrawfordCurrie - 10 Feb 2004

Sorry, I MUST protest.

From the user's perspective it is a big diffrerence if the data/metadata is stored in the database or in the files outside. Look at the practice: most archiving systems or CMSs cope with a serious problem: even more than 50% of information objects drown in the hell of database and is almost impossible to retrieve. Try find an article in Squishdot which was published 2 or 3 months ago. Maybe You will find it within 10 minutes, but average user - never or by accident! Real problem of storage!

Efficiency is good motivation but it is a motvation of the developer, not user. I understand that form the engineering point of view efficiency is a dimension of excellency. But not necessarily for the user. Who is looking for Wikis? The people who demand easy and intuitive access to unstuctured data, and want to structure this data BY THEMSELVES, want to have all control over the data. Wiki is the only such application in a market. And why companies prefer Twiki over any other Wikis? Because of the storage outside database, in extremely simple txt format. Because they don't trust databases.

I have many experiences as a user of Zope, for example, and know some other users. What are they doing usually when developing portal? They install so called External Files tool to store information objects.In fact, metadata is still stored in database and - in fact - it always is the cause of problems. Real problems of storage!

There is not only a problem of reliability of databases. It is nastily to say, but the efficiency of databases is attained at the cost of extreme complexity of Database Management System, the complexity fully hidden and not accessible by the user. And - nastily - most od CMS, or CMRs etc. based on databases DON'T work how the user wants but work only how the developer coded. Maybe except only most expensive ones, but I doubt. For really efficient database the admin is necessary, and heavy support, particularly when you want to change something. In contrast, Twiki user has the sense of data, and is feeling safe because she or he can control all the data, and even read txt files if something goes wrong.

More generally: database philosophy is the philosophy of the system controlling user (programming user's behavior). It is good and efficient if you have standardized data structures, and users perform relatively simple tasks. But it is not a case when you come to Knowledge Management because Knowlegde cannot be closed in the database. Knowlegde is a net (web) not a stack of records. And now the era of services came, and services are based mostly on knowledge. Give the user full control on the system, give the sophisticated but simple tool (pliers) for data manipulation, don't hide - and You will win in the era of knowledge.

-- AndrzejGoralczyk - 10 Feb 2004

Well, independently of what AndrzejGoralczyk notes above, from our corporate users point of view, we use twiki in spite of that it does not have a data base, not because of that....

-- ThomasWeigert - 10 Feb 2004

really? all the ppl that have sold the twiki to their management like the fact we don't use another database, and that all content is versioned (that would be horrible to do on a database)

sample size of around 7 smile

-- SvenDowideit - 10 Feb 2004

Sven, I can only speak from my own experience. My twiki supports 3500 users at over 30 locations world-wide. The number one complaint from the community is that the structured information is not in a data base which allows easy access to history, querying like a data base, etc., and therefore, that twiki is very slow in everything but the white-board application.

These opinions might be influenced by that we all have access to a large data base application and that most user communities have built applications around that data base.

For my further deployment of twiki it would be great help if I could have all topic data, or at least the meta data, stored in an external data base. Incidentally, we evaluated Zope as a CMF and web application delivery platform quite intensively, and the poor integration with an external data base was one of the primary reasons for rejection.

-- ThomasWeigert - 11 Feb 2004

Well said Andrzej, I think exactly the same!
-- ColasNahaboo - 11 Feb 2004

As a followup to above, if I were to redesign the storage access part of twiki I would put in an abstraction layer which would allow users to choose how to store their data. It would allow a user to store it in the file system or in a SQL data base, or whatever they cook up, as long as they provide a mapping. Similarly, the search mechanism would have to go through an API that either calls grep or executes a SQL query, or whatever.

-- ThomasWeigert - 11 Feb 2004

Let's get something clear here, Andrzej, Colas; the end user, she-who-browses, does not care if data is in a database because the end user does not know how the data is stored. The person who does know, and possibly does care, is the system administrator, the TWiki installer. It is their case that I think you are arguing, not the end users'.

Having said that, I have encountered exactly what Thomas reports, both at Motorola (a completely non-intersecting user base to Thomas', BTW) and at other companies. End users and administrators have expressed expectations about what they should be able to do with forms data, and are surprised and unhappy when you reveal the facts; they immediately perceive a scalability problem.

The arguments presented above against use of a database to store metadata are specious; they are circumstantial, and are symptomatic of poor implementations, not of the concept. For a counter example, look at TikiWiki, which seems to do quite nicely, thank you.

As is often the case with this sort of question, it comes down to religion. I don't actually like DBMS's (a.k.a "The Black Arts") very much either. But what I dislike more is being locked into one and only one specific implementation of the data store, and a massively inefficient one at that.

If a well-specced Store abstraction layer existed, this discussion would not be happening. We could simpy code both approaches and see which one worked best.

-- CrawfordCurrie - 11 Feb 2004

OK, sorry for being a little bit too emotional yesterday, and my article above looks too much as a propaganda. I'll try to clarify my point of view a little bit more.

You are right that enduser not necessarily needs to know how metadata is stored - if You imagine that enduser is a person simply typing in data and searching for documents (= generating reports). But such "theory of enduser" is less and less valid, especially in the case of knowledge workers.

According to this experience, metadata section is perceived in two ways: traditionally, as a kind of label concerning the document, and in new way - as a section decribing relations of the document to the other documents and to the broader context of a set of documents. In the first case, in fact, You have two information objects: a document and a label. In the second case You have a document with DOM structure, and - in most radical cases - user wants to design particular DOM section to specific content/function/task/kind of relations.

Look at the structure of Wiki document. It contains a header, a body divided to sections, a footer, navigation bars and "metadata"... in fact it is similar to the record of database. Of course some sections, as footer and header are changing rarily, and it is a sense to hide them and unable users to manipulate here. The other sections, as a body and "metadata" can change from document to document and therefore it is a big temptation to locate them in the database. In fact, such a solution has big advantages, and I understand the some users seeking for database functionality. They need a scalability of the number of documents processed. But the other users want to have a single consistent document not composed of the fields of database, and full control over the contents - they need a scalability of a document!

Let's go to the "metadata" section. I write "metadata" in parenthesis in order to stress that this is metadata of the topic (content), not metadata of HTML representation of the topic (You don't see this metadata in the HTML source). It is neither a part of the structure of the file. This is only and only a separate part of the content. If You standardize it and store in database You gain efficiency and lose the possibility of scale it by the user.

The example form Zope. They are proud of having Dublin Core description attached to every information object. But it is only part of DC, and doesn't contain a dc:type field. So, I - the user - am not able to select all white papers (theses, working reports or so) from particular folder. I can try to include dc:wite_paper to the keywords, and select according to this keyword. OK, but when the user displays any wite paper in the portal, the slot "See also" displays not only the documents on the same subject but all the white papers! Nonsense. And it is not a problem of poor implementation - it is false concept of having metadata out of conrol of the users, in the inflexible database!

OK, developers can handle this problem, and add some tools enabling to add extra metadata fields. But these tools can only generate another attachement to the information object. The level of complexity goes to the zenite if you want to rename the object, move it etc. OK, lets do a tool for annotations, then do a separate annotation server... the straight way to nowhere. And how many attachements to the document keeps the sense?

In contrast - we have one document with some DOM sections. Document already containg simple and flexible DOM structure which can be used a database structure.

Moreover, probably the most important change in users' requirements is to go to fine-grain: to treat a section of document as an information object, not the whole document. There is growing demand for the concepts like implicitly named sections; see also my SearchingSectionsOfTheTopics. But this is another, albeit related story...

Conclusion: It is very good idea to "code both approaches and see which one worked best" as You write, Thomas and Crawford. I expect both smile

-- AndrzejGoralczyk - 11 Feb 2004

In a sense, both approaches have already been coded. One is the way it is done today, and the other .... well, I enjoyed your deconstruction of a TWiki topic, above, because that is precisely what is done in the FormQueryPlugin. In the FQP a cache database is built - with the topic text as one of the fields in a topic record. Because of the data-driven nature of TWiki forms, only a small number of fields in a topic are "formally modelled" in this database - specifically, parent, topic text and topic info. All the other fields are driven from the content of the %META in the topic, and cross-referenced against the schema (forms templates).

I'm not claiming the FQP code is directly re-usable - it isn't, because the FQP extracts other data as well, such as embedded tables and inferenced relations to other topics, and it jumps through hoops to be compatible with current TWiki - but it serves quite well to demonstrate the feasibility of such an approach. And much of the code - including all the SQL processing - would be directly re-usable.

But the bottom line is, for the approach to be tested properly requires what Thomas calls for above - a well-specced Store interface. Until we have that, the rest of the implementation is wishful thinking.

-- CrawfordCurrie - 11 Feb 2004

Metadata, the Mac, and you

-- WillNorris - 29 Apr 2005

Edit | Attach | Watch | Print version | History: r10 < r9 < r8 < r7 < r6 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r10 - 2005-04-29 - WillNorris
 
  • Learn about TWiki  
  • Download TWiki
This site is powered by the TWiki collaboration platform Powered by Perl Hosted by OICcam.com Ideas, requests, problems regarding TWiki? Send feedback. Ask community in the support forum.
Copyright © 1999-2017 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.