TWiki Roadmap - The essential questions
A debate topic to trigger thought in the community. The actual roadmap is maintained at TWikiRoadMap. This topic has become a brainstorm topic about how to implement the most important roadmap items on 5.0
Is Freetown 4.2 or 5.0? This topic is targeting a 5.0. The purpose of this topic is to ask some questions to consider for TWiki 5.0.0 and try to give some ideas for how to answer them.
KennethLavrsen will now list a number of opinions that have a tendency to cause passionate debates that rarely leads to results.
Storage. Database or flat files?
- TWiki should be based on flat files because
- This is how it has always been.
- This is easy to understand and manipulate with plugins and external programs
- Easy to upgrade.
- Easy to move things around between TWikis and between webs
- Easy to backup. A database often requires to be stopped or contents dumped in flat file for safe backup.
- Easy to repair. A broken database is difficult to get to work again.
- Hacker friendly
- TWiki should be in a database or other intelligent storage
- Searching could be faster (done by database engine)
- A database indexes the content to speed up searching.
- TWiki would have a simpler storage and probably much faster when having many records instead of many flat files.
- Meta data can be separated from content and optimized for speed
- Form data can operate like a real database because it is a database.
- We can interact much better with other databases
- A TWiki can interact with another TWiki through the database interface and make synced TWiki installations much easier to implement.
- I can load share 3 TWikis with one database server for high traffic site for example.
Who is right? BOTH!
Which one to pick? Neither and both! Read on. We will come back to this.
Meta data with topic content or outside
- Keep it inside because
- The meta data follows the topic. If I move topic files around the meta data follows the topic. Easy and simple.
- When versions are saved in the revision control system all the meta are in the history giving a perfect audit trail
- Keep it outside because
- It is inefficient to search meta data by reading all files.
- Settings in other topics are slow to fetch and slows down TWiki
- Access rights requires reading text files all the time slowing down TWiki.
Who is right? BOTH!
Which one to pick? Neither and both. Read on. We will come back to this.
- Use them everywhere we can because...
- CPAN libs are normally well tested, especially when you use those that have been around for years.
- CPAN libs are free open source and we can contribute if there are problems.
- CPAN libs save a lot of coding making the small community to do more in shorter time.
- Our own homemade code has more bugs.
- Avoid CPAN libs because...
- It is a pain to install them
- It is a pain to have compatible versions of the libs
- Some of the libs can be buggy and we do not know which ones the customers have installed
- Many non-root installers cannot install missing libraries
- Many shared host providers lacks some of the CPAN libs that we need.
- Many CPAN libs contain more code than we use but they are still compiled and their dependencies are sometimes also compiled even when not needed and this slows down TWiki
Who is right? BOTH!
Which one to pick? Neither and both. Read on. We will come back to this..
Should TWiki be backwards compatible with all old versions and only add new syntax?
- Yes
- If we are not maintaining this principle noone can ever upgrade large TWiki installations. The value for customers is in the content. Not the TWiki.
- If we have not maintained compatibility in the past, people will know we as a community will not maintain it in the future. And then we have no customers. And quickly also no developers because most developers are also customers. Those developers that are consultants and have no professional use of TWiki are usually the ones that do not care about compatibility.
- No
- If we always have to be backwards compatible we can never make TWiki better in the basic features.
- If we always have to be backwards compatible we can never make TWiki faster.
- We can convert topics with upgrade scripts.
Who is right? BOTH!
Which one to pick. Both. I will explain what I mean later.
And now we come back to the questions.
So what is the message?
- If we always make these very essential issue a matter of black and white then we
- either destroy the project and in reality fork out a new from scratch
- or we stall the project and it kills itself by bad performance and competitors kills us with their much better products.
When you look at the big issues above then no matter what we pick TWiki is doomed!
Do we give up? No way. The solution is always hidden in the
3rd option. And sometimes the
4th.
We need to dig into the requirements and find solutions that fulfill them all. And the requirements should never be the technical requirement. It should be the fundamental requirements seen from the customers perspective and the developers perspective in combination. Customers (we call them that even if TWiki is free) are admins, installers and users.
You cannot easily give full answers to all these huge questions. But it is the hope that we can trigger some of those 3rd and 4th options.
Storage - The non-technical requirements
- Customers want to be able to easily move topics from one TWiki to another. From an older version to a newer version. From downloaded zip files. And I want to be able to do it in a simple way like copying files.
- Customers want to be able to easily hack the topics outside TWiki. By hand or with other programs.
- Customers want to be able to repair broken topics
- Customers want to use databases with TWiki and use TWiki more like it is a database in TWiki Applications
- Customers want a much faster TWiki that does not slow down when I have 10000s of topics
- Customers especially want searching to be faster because this is really slow today
- Customers want all my TWiki applications that uses complexe SEARCHes to work when they upgrade TWiki.
- Customers want searches that search the awful META data based forms to work when they upgrade TWiki.
- Customers do not want to rewrite existing topics and TWiki applications.
Can we meet these requirements? Why not? If TWiki stores its information in a flat file, AND in a database. All searching happens in the database. All viewed topics are shown from the contents in the database. Saving saves in the database and in the flat file. Maybe only saves in the flat file on demand. If you throw in a new flat file, TWiki can pick it up and put it in the database on demand. If the import and export from flat files is only done on demand, then TWiki would not have to carry the load of the flat files under normal operation. And it cannot be a big deal to convert data back and forth.
Storage may also be implemented still using flat files combined with other indexing techniques so that all searching happends using the index. META like access rights and form data may be in a database format. And you can still continue to write forms and access rights in the flat text files to maintain the audit trail but all the reading of forms and access rights happens from elsewhere. Nothing prevents combining the best from both worlds.
Meta data - the requirements
Customers want a full audit trail of all the changes including form content, dates, authors, access rights etc. If soft security is to work, this is essential for a wiki. In a database we can maintain this audit trail. And if we export to flat file we can use good old rcs to pack the entire history in a ,v file. And natually we can import these as well. But inside a database customers do not care how the data is organized. But it would actually be nice to search in older revisions. One often lacks the ability to search one topic for older versions by date and generate a table of a value in a form over time. Maybe with the right storage model for both meta and revisions such a thing could become feasible. Today it would be terrible with the flatfile ,v rcs files.
CPAN libs - the requirements
Developers want reuse of quality code. Customers want quality code. That is a common goal. Customers want to be able to install TWiki. They expect to download a "setup.exe", double click on it and BANG it is installed. Windows programs have done this for years.
For
CPAN libs to be accepted we need to find methods for installing TWiki in an easy way including installing dependencies as non-root. An obvious suggestion is to distribute all the
CPAN libs TWiki uses. That is most of a perl. Not feasible. Then we can provide all the non-standard
CPAN libs. Better. But there can still be issues with compatibility with the standard libs. And some
CPAN libs are partly binary and will only work on all platforms if compiled from sources. Many shared hosts and aggresive UNIX admins does not give access to a compiler. So the solution here is probably a combination of the possibilities. A better installer which can also install
CPAN libs. Some libs in the distribution. Some binary
CPANs in a TWiki contrib compiled for a couple of the most common CPU/OS platforms. Linux/GCC/glibc/i386 will cover 90%.
Other possibilities are to provide TWiki has rpm and deb and pick the right palette of rpms/debs for it. I have personally installed all my perl
CPAN libs from Dags rpm repository for my Centos 4.4. They are all there. And it works flawlessly. We can really do a lot with very little if we want to.
The basic requirement is that people that knows nothing about
CPAN. Nothing about Apache configuration. Nothing about much... can install TWiki. And there is probably not ONE solution that fits them all.
Compatibility - the requirements
Compatibility is essentially the requirement to ensure that the millions of staff hours our customers have spent on content is not lost if you upgrade. AND. The requirement is that it should be possible to upgrade. Not upgrading is not an option.
This is not an impossible task. Others have done it before us. But it is extra work.
Making silly upgrade scripts that search and replace inside the topics are no good. They simply will not work. Especially advanced searches will break. Many searches search for syntax and often the search pattern is dynamically generated.
Some of the worst hacks people have made may have to be sacrificed but the general rule must be that what we change is backwards compatible. Not forward compatible. Ie. new topics need not work in old TWikis.
This is make or break.
But we can convert old to new on the fly. We do not want people to sit with manual scripts. But if we look at my thought about changing storage to a fast database approach in combination with import and export of flat files then this can be the conversion method as well. When you read an old topic you convert to the internal database storage format. When you export the format it the new.
The basic requirement is that I can take all my or one million topics from my old TWiki and upgrade the TWiki and all my old data will be imported into the new TWiki and work including my TWiki applications.
The largest challenge here is for sure searching. For example old Cairo apps will be searching directly in META when looking for form values. We would need to store the form data in an additional META syntax so that searches can find the data. Or we can make the search very smart. The requirement is not the technical solution. It is that my TWiki Applications continue to work so I do not have to spend staff years manually rewriting searches and other advanced TWiki
TML in 10000s of topics. How this is done is implementation.
Last word
The purpose of this long topic was to start some thoughts for a TWiki released some time in 2008.
I really encourage the community not try and persue another TWiki release with 50 small hacks and minor improvements.
We need to spend a long time architecting these essential basic designs if we want TWiki to become
- Scalable = faster
- Maintainable = more reuse of code and storage architecture
- Better = better features instead of just more features which requires attacking the compatibility and on-the-fly upgrading problem.
- Fun = take the time to be innovative, bring up many many proposals for implementations and choose the best, combine them and implement them. Fun is also the basic requirement for expanding the community
- Easy ... to install and to use
I believe we should take at least 3, better 6 months to architect TWiki-5 and maybe code some concept code that may never be part of TWiki to verify performance etc. And for those that just want to code and code there are many plugins that could need a good touch up and there is the Apps web where I personally will spend a lot of time. And on the core there will always be a bug to fix for patch releases. There is plenty to do. But if we do not take the time to really architect the good design then I predict TWiki will be history sooner than we would like.
--
Contributors: KennethLavrsen - 24 Jan 2007
Discussion
Yet another great read, thanks Kenneth. One thought: don't count on architectural design (documents), do it agile. You're absolutely right about the fun part!
--
FranzJosefSilli - 25 Jan 2007
havign worked with and developed with, and upgraded, and converted between (flat files and) various database backed systems such as other Wikis and CMS and blogs, I can assert the advantages of the same. The arguments against it just don't hold water.
Yes, conversion is always an issue. But once again its a matter of delivering a professional and tested product.
I just want to see if you can do this in Perl before I get in done in my spare time using Ruby-on-Rails.
--
AntonAylward - 25 Jan 2007
Wonderful!
I skipped down here after I saw the BOTH answer for flat-files and DB. I have implemented such a solution for our intranet TWiki. The performance is much faster. The TWiki .txt files are still created, and the contents are also loaded into a DB. The DB is implemented in SQLite (for simplicity). Text contents is separate from Meta data, which separate from Form data. Topics are aways fetched from the DB, (which using mod_perl is always open). The .txt files are written for backups. Using a DB, means we can now do real select's for searching, orderBy's, full-text indexing. This is very exciting that development is considering this path.
--
CraigMeyer - 26 Jan 2007
"I have implemented such a solution for our intranet TWiki". You have code you can / will share?
--
KennethLavrsen - 26 Jan 2007
Sure, It's not complete, . . . yet

(Note: I am writing this at home, from memory, the code is at work.)
. . . editted for highlights . . .
- Implemented as a plugin replacement for %SEARCH%
- Uses CPAN's DBIx::TextIndex for full-text indexing
- Schema; META & FORM data in fields
- Cons:
- SQLite does not support Regexp's (except for equivalents for '.' and '.*').
- Topic names searched as "%word%" rather than an exact match.
- SQLite does not support CaseSensitive search, other DB's do.
- Pluses:
- Does Modified (ie. Changes) search w/o reading all files, as the lastupdate time is store in the DB
- Extended %Search{}% with an author="..." parameter, (so can do Changes by given author or authors)
- Does phrase distance searches (ie. "Perl Java"~4 - meaning Perl & java within 4 words distance)
- SQL parser in C
- Future ideas: (ex: SQL selects, Ordering of results, storing result list into variables, separating Search from the formating of results)
--
CraigMeyer - 26 Jan 2007
Further discussions should be done on a seperate topic. Mind: there already has been done much work on that road by Thomas and Michael via the
DBCacheContrib resp.
DBCachePlugin derived from Crawfords
FormQueryPlugin. Craig should put his work on
SVN.
--
FranzJosefSilli - 26 Jan 2007
Craig, as Franz Josef said, it would be great if you could put your work somewhere where we could take a look at it (just zip it up and attach to some topic, e.g., here). There are a number of us who have been playing around periodically with different types of backends or leveraging some type of data base in a transparent way. It would be great to study your strategy for inspiration.... (Just FYI, I am using
YetAnotherFormQueryPlugin and
YetAnotherDBCacheContrib extensively in almost all my TWiki applications which do also have replacements for search. This does regex, exact match, case sensitivity, but does not do distance search.)
--
ThomasWeigert - 26 Jan 2007
Thanks for your interest in the code

I read the code, and in light of the discussion here, see it more as a prototype for combining the best of DB's and flat files. Most of the effort was spent fitting it into the existing Search.pm design, to evaluate the speed of using a DB. Both for doing the search, and for fetching the hits for formating. This resulted in a much faster search. Though all META was stored, only TOPICINFO, Topicname and the contents itself, were searchable through the search.pm interface.
Following Ken's suggestion of forward compatible, I will extract the good bits, and think about how a SEARCH "should" be structured. My suggestion would be a more
SQL feel in the interface so maybe a %SQLSEARCH{}%. A more User friendly search plugin could then be written to bridge the gap, between the existing, and this.
Separating the search from the formating of the results would allow for other uses of search (counting, use in html select drop-downs, etc).
By using a DB (SQlite, mysql, Oracle, ???) we can benefit from it's speed, rather doing a lot of DB like stuff in Perl.
...sorry, interrupted, I will add more later.
--
CraigMeyer - 28 Jan 2007
The code was just to get idea Craig.
The concern with using a database storage is that the most powerful feature we have in TWiki are the formatted searches where you can regex both what you look for, the header of the result and contents from the topics or single lines that you searched through.
The syntax is terrible. But then Regex is terrible but probably one of the most powerful things ever invented.
Losing the formatted search and only have the much more limited search that standard
SQL gives you is the same as reducing TWiki to yet-another-wiki. So we need to somehow overcome this. Either by creating our own storage format optimized for TWiki or by a combination of different schemes.
A super fast SQLSEARCH that lives side by side with the slower but more powerful flatfile SEARCH is an option. But as I said above - it s often the 3rd option that noone have been thinking about that is the best solution.
--
KennethLavrsen - 28 Jan 2007
Craig, as Ken said, please don't be to worried about the state of the code. The point is to have people who have thought about this topic but got stuck for one reason or the other stare at it and get thinking.... the "many eyeballs syndrom"....
--
ThomasWeigert - 28 Jan 2007
Ken - I was also missing the power of regexp's with the SQLite search. I noticed that SQLite does provide a hook for
REGEXP
. So, technically we could implement a C regexp, which would be linked in and the used from Perl. Using
PCRE
. After looking at the SQlite src, it may be a straightforward thing to implement, I will experiment this week at work
At worst, the DB could be queried in a tight loop to implement regexp. It would still be faster than reading all the files.
Something like: (pseudo-code)
foreach ($name, $text) ( $pdb->select_array("select TOPICNAME, DATA from Topics where Web = 'TWiki') ){
if( $text =~ m/$regexp/ ){
push(@hits, $name);
}
}
I will post the Schema (or really the SQlite init code), along with the code which builds the initial DB, once I get to work.
An example future search:
%SQLSEARCH{query="DATA REGEXP '/regexp|match|search/i' and AUTHOR like 'Ken%' and FIELDNAME = 'TopicClassification' and FIELDVALUE = 'NextGeneration'"
orderby="LAST_UPDATE" format="....." }%
would display all topics which mentioned searching (ignoring case) with authors starting with 'Ken' and Classified as
NextGeneration. (Note: I DID not implement this syntax!)
--
CraigMeyer - 28 Jan 2007
A few words on "should TWiki use a database backend yes-no". There are only a few rare cases where you
don't need a database. As a CMS, TWiki is not among those cases. Au contrair, it is more astonishing that we still don't have a database backend and did get away without so far. There's not much more to say about this. A database backend is the "necessary evil" you simply have to accept.
However, applications do not necessarily regard the database as the "authoritative location" for the documents it contains. Databases are good in indexing, caching and processing search queries. The fact, that the data itself comes with the query result is just a convenient coincidence. It is not so seldom that the documents that get stored and indexed in a database are available in isolation as well, and that these can be used to populate the database anew. We have seen these techniques been explored in an array of extensions to TWiki already (
DBCacheContrib,
FormQueryPlugin,
DBCachePlugin,
XmlQueryPlugin). And I think it is about time to integrate these kind of techniques (not necessarily exactly these) into the core.
--
MichaelDaum - 27 Jan 2008
Congratulations Kenneth and everyone that has contributed to this discussion!
I think that
memory usage is forgotten and it's a relevant point to consider in respect to
Performance & Scalability. Particularly, I'll work on this issue to verify if good improvements can be made. I think so cause many data used for each response is common to all of them and accessed only for reads, so they could be shared, what could increase server response capacity with the same amount of RAM.
--
GilmarSantosJr - 11 Mar 2008
I can't help feeling that this database-versus-flat-files argument is missing the essential point.
Historically the main block to moving to database-backed store has been the extensive use
by end-user applications of the %SEARCH function to search meta-data. This problem has been worked around by
DBCacheContrib and other work such as that of
CraigMeyer. The idea has been to separate
structured searches from
text searches, but the success of this approach is fundamentally limited by the cooperation of the end users in porting their unstructured %SEARCHes to structured %DBQUERYs.
The holy grail is an approach that accelerates the existing %SEARCH function
without requiring rewrites of the end user %SEARCHes. Even if you cache the flat-text DB in a database - or move to a DB store and cache topics in flat text files - you will still achieve nothing as long as %SEARCH is forced to search meta-data using regular expressions. This is because it is possible to map structured queries to regexes (I did this for
type="query" searches), but the reverse, mapping regexes to structured queries, is
not feasible (I'd be glad to be proved wrong).
So, the current compromise. If you want fast data access, you have to install something like the
DBCacheContrib (and one of its UIs, either
DBCachePlugin or
FormQueryPlugin), and write queries compatible with the structured view they provide. The
DBCacheContrib has been written to be 100% compatible with the existing TWiki main database. Unlike Michael I do
not advocate including dbcache-style support in the standard release, because it can be tricky to install (because of dependencies on external databases), and is not required by most end users. I supported
type="query" searches in TWiki 4.2 with the goal of providing a standard framework which extension implementors could plugin in their database solutions
without having to invent new syntax. I have a personal goal of porting the
DBCacheContrib to support this new standard. We should be in the position that no-one
has to use a database, but if they
do, then their queries get accelerated.
What else can happen in the future? Well, someone can work out how to accelerate regexp searching of a flat text database to the extent that such external database-based query accelerators are redundant. There's a few PHd's in that. Possibly a Nobel prize, too.
The second key question asked above was "to
CPAN or not to
CPAN". I'd like to reinforce Kenneth's points above. Most
CPAN modules are installable by non-privileged users; the key for TWiki is to support local installations of
CPAN modules. TWiki is a compromise; it uses Perl, because Perl is fast to write in, not because it's fast. Similarly we use
CPAN to avoid re-inventing the wheel, not because it's particularly great. Better to work out how to play with it nicely, than to throw the baby out with the bath water.
So, for TWiki 5.0, I personally believe the focus should be on
communication - ensuring users have access to quality documentation, training materials, and expertise that shows them how to make the most out of the powerful solutions we already have. An hour spent gardening in Codev is worth 5 spent debating core architecture.
--
CrawfordCurrie - 15 Mar 2008
It is a fact that we have a performance issue.
Addressing communication does not address the performance issue.
We need a storage scheme that enables fast query searches. The regex searches that plow through meta are slow and I have no expectations to ever see this change. But the roadmap discussions we had clearly showed that many of us see a
need to have our twiki applications run fast even as our number of topics go from 1000s to 10000s and 100000s.
The way I believe we should start is by not overdoing thing too much.
- Keeping our flat files has a lot of advantages
- We customers feel safe that files are in a hackable format. I can fix things. I can move files around easily. I can script things etc etc.
- Easier to migrate both for coders and customers.
- Audit trail is given to us for free and noone needs more audit trail than what we already have.
- What ever database storage we make - you should be able to delete the whole database and twiki will rebuild it
- If you hack/copy/delete flatfiles you should be able to manually trigger a rebuild of the database.
- The databases should in release 5.0 be as simple as possible. I see this as a good starting point
- All access rights are put in a database every time you save a topic. No more running though .txt files to check for access rights.
- All form data should be in tables in a database. This must be simple to do. Form data are simple field name - field values data sets.
- All other topic meta data in database. Again simple because it is already structured data with a field name and a value or set of values.
- Either in 5.0 or 5.1 depending on resources the first step of topic object model would be to just put tables in a database format. Most of my TWiki applications that relate to searches presenting data from many topics either query the form values or do nightmare regexes to dig out data from tables. Having tables in database storage combined with query search will kick major ass. I accept that I have to rewrite all regex searches to query searches to gain any performance.
I think we should implement a default DB storage that does not require an external database resource like
MySQL or Oracle, and make this pluggable to you can use an external database engine instead of the built-in. And then let plugin writers write the Oracle,
MySQL, Postgresql etc etc interfaces. The essential detail is that our internal default database must index the data so that queries even on 1000000 topics happens in subsecond time.
--
KennethLavrsen - 15 Mar 2008
Nice discussion. I also found out that working with file-based databases like SQLite (or simpler - but much faster - systems like qdbm or Tokyo Cabinet. Didnt try textdb) is really an order of magnitude simpler and more robust that databases in separate processes (like mySQL), so we should definitely try to limit ourselves to these "linked library" file databases. On the other hand speeding up file-based regexp searches, some leads:
- Use
grep -r (with includes/excludes) to let grep do the work of selecting files. we would require GNU grep.
- Organise
data/ so there is only the .txt file there, so you can use grep -r more efficiently (no RCS, .changes, lck, etc files there)
- Use a simple caching strategy: build in each web an index that contains all the topics as one line per file (replacing newlines by by \r for instance). This is used for some simple web search engines, and is suprizingly fast, especially with
grep -mmap
--
ColasNahaboo - 15 Mar 2008
Kenneth, we have already done a lot of work to create exactly the sort of pluggable store abstraction you describe above. It's already there. Most of your wishlist above is already in place, in the
DBCacheContrib, and has been for
years. Sure, we could do with some better DB store implementations - I'll be the first to admit the
DBCacheContrib was a quick hack - and maybe one of these should be selected to be part of the standard release. But the fact that you may not have realised how much is already in place is part of what makes me ask for better communication. I don't want to see the wheel being re-invented again.
A quick aide-memoire: the
DBCacheContrib works by maintaining a cache of all the form fields in TWiki topics in a simple database (TDB). It publishes a query interface - which was the grandfather of the
type="query" search - that is used to perform structured data searches. plugins layered on top - such as
DBCachePlugin and
FormQueryPlugin - provide additional support for result sets and output formatting of large data tables. It depends on the data, but the DBCacheContrib scales up to ~10X further compared to flat files. It currently uses its own query language, though it can be extended to support the
type="query" format.
Note that I'm
not advocating using DBCacheContrib in the standard TWiki release, for the reasons I pointed out above. I'm just pointing out that anyone else can sit down and write a better database accelerator module
today should they want to.
--
CrawfordCurrie - 15 Mar 2008
The problem with DBCache is that it lives its own life with its own syntax.
The implementation in 5.0 must work with our standard INCLUDE and SEARCH syntax. But if the code behind DBCache can be the foundation then it is great.
We must however make sure we have the right glue (API) that enables alternative database implementations.
So the task could be
- Define the API incl the DB table layout.
- Implement an extension that can be based on the DBCache which will be part of the default set of extensions so that all query type searching uses the indexed database.
Does the DBCache also put the access control lists in database? Surely an important performance parameter is to avoid scanning for ALLOWTOPICXXXX in maybe 100000 files each time you do a search.
--
KennethLavrsen - 16 Mar 2008
DBCache may have been around for years, it may be the best kept secret of TWiki. The documentation is way too complex. It needs an explanation in plain English, plus usage examples.
--
ArthurClemens - 16 Mar 2008
DBCache has its own life because (1) the core code is too slow in picking up good ideas and (2) because it was needed on legacy twiki engines right away and (3) because it allowed to use TWiki as a real application platform.
Give it a try yourself.
--
MichaelDaum - 16 Mar 2008
Can we bring this discussion as well as the roadmap to a more concrete level please. We all agree on the
strategic direction as far as I can see. The rest of the discussion is too uninformed and might very well
be proven irrelevant/not worth it as soon as we get more concrete.
--
MichaelDaum - 17 Mar 2008
Just some thoughts, not too concrete but.
- Would it be ok to not be backwards compatible to all prev versions, so customers would have to upgrade to TWiki 4.1 or 4.2 first, then 5.0? I think so, but others might disagree, what is acceptable?
- TWiki default install shipped with SQLite or equal DB storage, and import/export function that can convert data. File-based storage could be an option, but could be selected in
configure after default install. If file-based backend is wanted it could be selected then it caters for those who just want that.
- TWikiStandAlone opens many doors for trying and testing TWiki. When the
configure script is cgi'ed no other webserver is needed for a default install. And performance..would be good?
- Although
%SEARCH{ would not benefit from data being in a database, it could be handled as a separate issue. It's still an issue so why could it not just continue to be so. If there is a newer syntax then new users should learn to use that and existing old %SEARCH{ web could either continue to use slow searches or rewrite to snappier searches. That would keep compatability but also give new power.
--
LarsEik - 17 Mar 2008
Michael. I was not trying to talk negative about the DBCache suite. My argument was that 5.0 needs to integrate the DBCache with the standard syntax. We cannot just include the DBCache suite as is. I must admit that I never tried it because I did want to add yet another set of search and include syntax to my installation. It seems from what I understand now that this could be a good technical platform to base the 5.0 storage solution on.
Lars. If we do for a storage scheme based on the DBCache work then there is no compatibility issue with respect to file formats etc. I would be able to throw in topics from a Cairo installation on a 5.0 TWiki and after a quick rebuild of the DBCache (either by pressing a button, or by a script that cron can run periodicly) I would be flying.
The SEARCH in 4.2 has been extended with the new type "query". The idea behind this was exactly to be a preparation for a 5.0 DB storage scheme. The query search is designed to be easy to use for the user as well as being easy to convert into
SQL like queries.
The plan as we discussed it in mid 2007 has always been that the regex searches would not benefit much from a DB storage because no DB engine supports fast regexes. But simple word queries will benefit a lot from searhing in an indexed representation of the topics. And the query type searches will be flying with speeds magnitudes faster than today when you have huge number of topics.
This topic was one I created a year ago and I guess it came alive again because we needed a loose open place to try and gather our thoughts about how to do the wonderful things we have on the
TWikiRoadMap.
Naturally we will soon have to create specific proposal topics and modify some that we already have. But it is good sometimes to have an open brainstorm forum to test ideas and I think the past days has tabled some excellent idea (the idea to reuse the work on DBCache).
I am very happy to see interest building round our top items on the roadmap.
--
KennethLavrsen - 17 Mar 2008
I'd advise against reusing DBCache, because it fundamentally does not scale. Lynnwood and I spent quite a bit of time on it for a project with 30,000 to 50,000 topics, using large amounts of formfields, and essentially, it falls over in the same way as SQLite does - when you have to update (or test to check for update) large amounts of data, it gets slow.
The biggest thing that confuses me about this discussion, is that it seems (to me) to boild down to 2 sides of a coin - does TWiki continue to attempt to re-implement from scratch a database, or should TWiki just be made to allow it to leverage existing databases.
The second is what my work on
DatabaseStore is all about - to enable TWiki to leverage existing technologies, while allowing the user to choose what scale of datastore they need.
Thus supporting both small scale - ie twiki text files, and massive scale - farms of big iron
SQL servers.
DBCache in essence fails on 2 parts - its limited in its ability to cache larger number of topics, and it has its own query language.
As a stopgap, its a brilliant proof that TWiki can be much better than it is today, but it is not 'the answer'.
--
SvenDowideit - 18 Mar 2008
Sven the term "DBCache" has not been used here to say "let's move the perl code of
DBCacheContrib to twiki's core". No, it means "let's cache txt files in some database and let it index and cache and process and optimize queries for us". The idea behind "DBCache" is that the "authoritative location" of content remains in txt files and that a database is only used for what it is best at.
See my comments above.
Note, that the "DBCache" idea does not mean that you create a
DatabaseStore, as this would mean it will be the new authoritative location of the content. "DBCache" means that you keep track of changes in the txt files, but never ever directly change the DBCache with no such change in the txt files. "DBCache" means that you can always bootstrap a database powered cache using the txt files.
Again: "DBCache" does not necessarily mean to use any of the code in
DBCacheContrib for the reasons you mentioned, i.e. lack of scalability. There is no per se reason to assume that any possible implementation of "DBCache" would not work out and scale properly.
However, what would be totally sensible is to invest a limitted amount of efforts in the current
DBCacheContrib to see
what it feels like to have a real database store as a cache backend, while also circumventing its current scalability issues. They are
rooted (a) in the way the cache is kept in sync and (b) in the fact that it loads all of the cache into the cgi processes' memory, each.
--
MichaelDaum - 18 Mar 2008
Sven the term "DBCache" has not been used here to say "let's move the perl code of DBCacheContrib to twiki's core". good - it read to me like Kenneth was using it in this sense.
I have already invested quite a bit of time into
DBCacheContrib, and to be honest, the problem seems to me to be architectural - as Crawford points out regularly, it was never designed for large topic sets.
I should clear up one thing -
DatabaseStore does this too, though in the opposite direction - it considers the database as the canonical store (in the same way as the ,v files are today) and makes a working set of .txt files for what essentially amounts to the legacy operations like regex SEARCH. So I guess that
DatabaseStore has learnt from
DBCacheContrib.
--
SvenDowideit - 18 Mar 2008
As Sven says, the DBCache TDB-based implementation is not good enough as it is. All it is useful as is as a proof-of-concept. This is the main reason I have never bothered to move it forward to the core query syntax - the idea of implementing the cache in-memory has really doesn't scale. But it
proves the concept of a DB cache of the .txt files.
I don't think there is any future in the TDB implementation, but my points are:
- there is definitely a future in the concept of a scalable DB - based cache.
- such a cache can be iplement as an extension
Kenneth, no, access controls are not cached by
DBCacheContrib. However
WebDAVPlugin has a fully-functional implementation of a database cache of access controls, complete with Perl and C-code interfaces and testcases.
To me, the way ahead is:
- a DB-based cache extension module, which supports the interfaces already in the code for query searching.
- a proper integration of a search engine such as Kino.
While I think the idea of a
DatabaseStore is ultimately the right approach, I think there is up-front work required in the core before the internal APIs are ready for that. A cache, on the other hand, doesn't have to wait.
--
CrawfordCurrie - 19 Mar 2008
I can see the arguments for and against. And frankly, I don't mind which way it goes as long as:
- Backward compatibility is maintained. The transition from 3.x to 4.0 killed my first attempt to champion twiki. Many features in the upgrade were desirable, but we did not have the resources to get the plugins we had built to work. It undermined all confidence we had in the product. We still have a userbase for twiki 3.x, but its use is almost completely historical. New users see it as a dead end.
- Using a database in parallel with the file system is a high risk strategy. Recovery of any issue is difficult, due to the synchronisation required between the file system and the database. If we go that way, the easy way out is to have a master, that contains all data required for recovery. And a slave, that is always created from the master on recovery. Only the master is backed up. Whether the master is the DB or the filesystem is irrelevant. But don't split the data between the two.
- The feature that sold us on twiki is the versioning of the topics. Versioning must be implemented in the master, so that it is recoverable. Don't mind how it is done, but don't leave home without it!
Thanks for listening....
--
BramVanOosterhout - 23 Mar 2008
Bram. I agree with your view and you will note that I also keep on saying that I wish to keep the current .txt files and its format and its rcs history as the authoritive master.
The purpose for the second storage is to enable much much faster searches (query and simple search) and much faster access rights management.
To me it is not a question of letting go of the .txt files. I think the survival of TWiki for its existing users depends on keeping this as a rock solid requirement.
In my view a database storage will provide.
- Indexed representation of the topic content for simple searching
- Database table representation of access rights
- Database table representation of forms
- Database table representation of other meta data
- In a future database representation of elements of the content of the topics following a topic object model. I would start with tables to enable fast query searches in data stored in topics as tables.
- Databases can at any time be deleted and rebuilt from the .txt files.
- Databases can be rebuilt at any time after manipulating the file system adding/moving/deleting/hacking the .txt files.
- The history of each topic is still only in the RCS files
- Regex type searches still happen on the .txt files and will not gain any speed increase. Eventually most TWiki Applications that handles a large number of topics will migrate to using query type searches or searches tailored to query data stored in e.g. tables within the topics.
The purpose of using a database is not because it is smart but because searching non-indexed flat files and parsing meta data is simply too slow and prevents a TWiki to grow to enterprise level.
--
KennethLavrsen - 23 Mar 2008
Kenneth et al: Thank you very much for these deep thoughts on the future of TWiki. I like the professional approach of Kenneth and have one point to add: the new surface of TWiki 5.0 should be free of all unnegotiable usability-problems to make it look much cleaner and to be easier to use be unexperienced users. I will create a list of usability-bugs from my perspective soon. Just to give you a grasp: "In the upper right you have two search prompts called 'jump' and 'search'. No user can decipher between those intuitively. Additionally both lack a submit button, which is essential for most users." I will come up with more ...
--
MartinSeibert - 24 Mar 2008