Tags:
create new tag
, view all tags

Feature Proposals » TWiki Store with MongoDB

Summary

Current State: Developer: Reason: Date: Concerns By: Bug Tracking: Proposed For:
UnderInvestigation   None       KampalaRelease

Edit Form

TopicSummary:
CurrentState:
CommittedDeveloper:
ReasonForDecision:
DateOfCommitment:   Format: YYYY-MM-DD
ConcernRaisedBy:
BugTracking:
OutstandingIssues:
RelatedTopics:
InterestedParties:
ProposedFor:
TWikiContributors:
 

Motivation

A large TWiki site can become slow due to real-time queries on a file system. The NoSQL database enables fast queries, and data replication across data centers.

Description and Documentation

MongoDB is a NoSQL database that stores JSON objects as so called documents in collections. A collection maps to a table in an RDBMS, a document maps to a table record (without the rigidity)

The idea is to create a store with query capability to store TWiki topic data and possibly attachments. There are several options:

Option 1: Use MongoDB as a cache for fast queries

The idea is to define a TopicObjectModel and store key data of topics as JSON objects in MongoDB, one JSON object per TWiki topic. Since MongoDB is a schema-less database it is easy to store all TWiki meta data, such as topic info, form data, and attachment meta data. In addition, some topic content can be stored as well, such as WikiWords (for backlinks), headings, summary.

The "system of record" is still the TWiki topic .txt file, the MongoDB content is simply a cache, updated on each topic save.

Example JSON object of this topic:

{
  "topicinfo":{
    "name":"TWikiStoreWithMongoDB",
    "title":"TWiki Store with MongoDB",
    "author":"PeterThoeny",
    "date":"1450399906",
    "format":"1.1",
    "version":"2"
  },
  "content":{
    "summary":"Motivation. A large TWiki site can become slow due to real-time queries on a file system.",
    "wikiwords":[
      "MongoDB",
      "NoSQL",
      "TopicObjectModel",
      "WikiWord"
    ],
    "h1":[
      "TWiki Store with MongoDB"
    ],
    "h2":[
      "Motivation",
      "Description and Documentation",
      "Impact",
      "Implementation",
      "Discussion"
    ]
  },
  "form":{
    "name":"ChangeProposalForm",
    "fields":[
      {
        "name":"TopicClassification",
        "title":"TopicClassification",
        "value":"FeatureRequest"
      },
      {
        "name":"TopicSummary",
        "title":"TopicSummary",
        "value":"TWiki Store with !MongoDB"
      }
    ]
  },
  "attachments":[
    {
      "name":"Sample.txt",
      "attachment":"Sample.txt",
      "attr":"",
      "comment":"test",
      "date":"1176431954",
      "path":"Sample.txt",
      "size":"35",
      "user":"PaulWise",
      "version":"1"
    },
    {
      "name":"Smile.gif",
      "attachment":"Smile.gif",
      "attr":"",
      "comment":"Smiley face",
      "date":"928821124",
      "path":"Smile.gif",
      "size":"94",
      "user":"PeterThoeny",
      "version":"1"
    }
  ]
}

TIP This option is likely the easiest to implement. It is also the most compatible, for example the regex based queries continue to work without changes.

Option 2: Use MongoDB as the actual store for topics, but not attachments

With this option the TWiki topic .txt file is no longer needed, all topic content is stored in MongoDB. That means, versioning also needs to be done in MongoDB. For simplicity each version is one document.

The JSON object looks very similar to the first option, with these two additions:

  • It contains a flag to distinguish if the document is a top revision or not (for fast query this field should be indexed)
  • It contains also the raw topic content

Example JSON object of this topic - with additions highlighted:

{
  "topicinfo":{
    "name":"TWikiStoreWithMongoDB",
    "title":"TWiki Store with MongoDB",
    "author":"PeterThoeny",
    "date":"1450399906",
    "format":"1.1",
    "version":"2",
    "istop":1
  },
  "content":{
    "summary":"Motivation. A large TWiki site can become slow...",
    "raw":"%TOC%\\n---++ Motivation\\nA large TWiki site can become slow...",
    "wikiwords":[
...
}

TIP This option is likely more time consuming to implement. Regex based queries could be supported by storing TWiki topic content as a blob (called BSON in MongoDB), in addition to the topic object model.

Option 3: Use MongoDB as the actual store for topics and attachments

TWiki topic content and attachments are stored as a blobs in MongoDB. TWiki topic content is also stored as a topic object model for fast queries.

Because all content resides in MongoDB we can take advantage of replication for high availability and for redundancy across multiple data centers using distributed replica sets.

TIP This option is likely the most time consuming to implement.

Impact

WhatDoesItAffect: Performance

Implementation

Do be designed and implemented. Peter can't commit at this time. Any takers?

Likely requires a TWiki::Store::MongoDB store module and a TWiki::Store::SearchAlgorithms::MongoDB query module.

-- Contributors: Peter Thoeny - 2015-12-18

Discussion

We discussed this at the KampalaReleaseMeeting2015x12x17; Peter is doing a Node.js and MongoDB-based project for a client, which was the base for this discussion.

[3:32pm] PeterThoeny: my first project is not twiki related
[3:32pm] PeterThoeny: i lead a midsize internal tool project
[3:32pm] PeterThoeny: we use mongodb as the database
[3:32pm] PeterThoeny: and node.js for the app logic on the server
[3:33pm] PeterThoeny: browser side is jquery, ajax and such
[3:33pm] PeterThoeny: i really like this dev env
[3:33pm] PeterThoeny: node.js with express webserver is very easy to use
[3:33pm] PeterThoeny: and powerful
[3:33pm] HaraldJoerg: Yes, that's pretty agile
[3:35pm] PeterThoeny: there is mongoose to tie data structures directly to docs in mongodb
[3:35pm] PeterThoeny: really cool and easy
[3:35pm] PeterThoeny: say, you have a collection with users in mongodb
[3:36pm] PeterThoeny: you create a javascript object of any hierarchy (depth), and tie it to mongodb
[3:36pm] PeterThoeny: that's it
[3:36pm] HaraldJoerg: Yes, for some practical problems this non-SQL stuff is a very clean solution
[3:36pm] PeterThoeny: now you simply create new user docs and change user docs simply by manipulating javascript objects
[3:37pm] PeterThoeny: which brings me to twiki:
[3:37pm] PeterThoeny: i think mongodb would be an ideal database backend for twiki
[3:37pm] HaraldJoerg: Object persistence through RDBMS is always a bit convoluted compared to mongodb
[3:37pm] PeterThoeny: because mongodb is a schemaless db
[3:38pm] HaraldJoerg: I guess this would need work on a TWiki topic object model
[3:38pm] PeterThoeny: this allows us to simply store a twiki topic as a mongodb doc,and include the twiki form data, and
other page structure in a queriable way
[3:39pm] PeterThoeny: yes and no
[3:40pm] HaraldJoerg: And with all database-oriented backends, I'm concerned about efficient version management
[3:41pm] PeterThoeny: no: we could store the twiki topic text sans meta data as a blob, and all meta data as objects. in
addition, we can parse page data and store some of that as objects, such as headings, wikiwords (for backlinks). etc
[3:42pm] PeterThoeny: yes: full twiki object model - that could mean more strict structure for content in a twiki page
[3:42pm] PeterThoeny: so i think the former is more flexible
[3:43pm] PeterThoeny: with the former, we could simply keep the current file-based store and versionning, and use the 
mongodo doc as a fast cache
[3:43pm] PeterThoeny: e.g. master of record is still the twiki topic text
[3:44pm] PeterThoeny: another approach would be also the former (e.g. no: ), but with mongodb doc as the actual store
[3:45pm] PeterThoeny: in that case we need to create a mongodb doc for each version, and keep track of the top version
for fast queries
[3:46pm] HaraldJoerg: Yes, that emphasises the point about efficiency of versioning
[3:46pm] PeterThoeny: the mongodb as cache is faster to implement than the mongodb as store
[3:47pm] PeterThoeny: in mongodb you can create multiple indices, so a with an indexed "top version" flag it would be
quick to query all top revs
[3:48pm] PeterThoeny: one catch: mongodb keeps all indexes in memory, so be careful not to create indexes on too big
datasets
[3:48pm] HaraldJoerg: Sure.  But every version is a full copy in the database
[3:49pm] PeterThoeny: i am not concerned about data size on disk
[3:49pm] HaraldJoerg: RCS may be old, but is efficient for topics with hundreds of versions of small changes
[3:49pm] PeterThoeny: a typical twiki installation has 10:1 or bigger ratio of attachments : topic text
[3:50pm] PeterThoeny: so, as long as attachments are stored in the file system we are fine
[3:50pm] PeterThoeny: actually, attachments need to be considered as well
[3:50pm] HaraldJoerg: Almost agreed: For attachments, I seldom see more than one version, for topics, we have several
which are continuously updated
[3:51pm] PeterThoeny: in fact versioning attachments is the biggest thing, space-wise
[3:51pm] HaraldJoerg: Yes, because they're binary, RCS isn't nearly as efficient as with text
[3:52pm] HaraldJoerg: For most attachments in "my" productive TWiki, versioning is not required at all
[3:52pm] PeterThoeny: so in this case, storing attachments as blobs in mongodb would be feasable too
[3:53pm] PeterThoeny: one important detail is twiki's query language
[3:53pm] PeterThoeny: that needs to be mapped to mongodb queries
[3:54pm] PeterThoeny: twiki's regex search will not be possible unless we store the whole topic text verbatim as a blob
in mongodb, or keep the topic text in the file system
[3:54pm] PeterThoeny: so there are details to be worked out
[3:55pm] PeterThoeny: one advantage of mongodb is that you can create repliacted databases
[3:55pm] PeterThoeny: https://docs.mongodb.org/v3.0/core/replica-set-architecture-geographically-distributed/
[3:56pm] HaraldJoerg: File systems can do that, too 
[3:56pm] PeterThoeny: so that could be intersting for morgan stanley and other companies that have twiki installed in
multiple datacenters
[3:58pm] PeterThoeny: yes, possible on file system, but with manual plumbing
[3:59pm] PeterThoeny: let me write up a proposal without committing to it (lack of time)
[4:01pm] PeterThoeny: a more technical description on replication at
https://docs.mongodb.org/v3.0/tutorial/deploy-geographically-distributed-replica-set/

-- Peter Thoeny - 2015-12-17

How to install MongoDB on a CentOS or RedHat server:

1. Prepare yum repository:

vi /etc/yum.repos.d/mongodb.repo and add this content for a 32 bit server:

[mongodb]
name=MongoDB Repository
baseurl=http://downloads-distro.mongodb.org/repo/redhat/os/i686/
gpgcheck=0
enabled=1

The baseurl for an 64 bit server is: http://downloads-distro.mongodb.org/repo/redhat/os/x86_64/

2. Install MongoDB:

yum install mongodb-org

This package is a metapackage that will automatically install these component packages:

  • mongodb-org-server - contains the mongod daemon and associated configuration and init scripts
  • mongodb-org-mongos - contains the mongos daemon
  • mongodb-org-shell - contains the mongo shell
  • mongodb-org-tools - contains the tools: mongoimport, bsondump, mongodump, mongoexport, mongofiles, mongooplog, mongoperf, mongorestore, mongostat, and mongotop

3. Tweak configuration:

Review /etc/mongod.conf

Disable or configure selinux to allow port 27017:

semanage port -a -t mongod_port_t -p tcp 27017

4. Start mongo daemon:

/sbin/service mongod start

Check /var/log/mongodb/mongod.log log for: [initandlisten] waiting for connections on port 27017

Details at https://docs.mongodb.org/v3.0/tutorial/install-mongodb-on-red-hat/

-- Peter Thoeny - 2015-12-18

This was discussed at KampalaReleaseMeeting2016x01x07:

  • No committed developer and date at this time
  • Using MongoDB as a cache can potentially speed up things dramatically
  • Advantages of NoSQL over SQL in TWiki's case:
    • schema-less; allows us to cache arbitrary twiki forms with fast queries
    • regex queries, a must if we want to stay compatible with twiki query language
    • think in objects, not rows and columns; e.g easier to implement because of object nature of twiki topics

-- Peter Thoeny - 2016-01-08

Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r3 - 2016-01-08 - PeterThoeny
 
  • Learn about TWiki  
  • Download TWiki
This site is powered by the TWiki collaboration platform Powered by Perl Hosted by OICcam.com Ideas, requests, problems regarding TWiki? Send feedback. Ask community in the support forum.
Copyright © 1999-2017 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.