[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Wikipedia software was Re: [kDev] RFC: Kendra Tools Project Plan 1...



Daniel Harris wrote:

Hi Neil and All,

Thanks for your email. Please see comments and questions:

Date: Mon, 27 Jan 2003 12:22:45 +0000
From: Neil Harris

I'd start off by looking at the Phase III Wikipedia software.


I've taken a look at the site. It does have some good stuff there.

Do you advocate merging with their information gathering effort or just using their software or participating in the development of the software (separate branch or keep within their branch of the project)? Should Kendra Foundation pay someone to develop Wikipedia to become kendraTools?

I think Kendra is trying to solve a different problem from Wikipedia, which has the "narrow" aim of writing a multilingual encylopedia on all subjects. However, it does not hurt that their content is available under the GFDL, should we need to use it. See also the Wikitionary, which is an early-stages project to use the same technology for a dictionary.


It meets several of the requirements already, namely:

* International: already supports a large number of languages,
extensible to any Unicode-supported language.
* Logging: full history for all pages, support for admin tools


Can this only support pages/articles? I mean can it support data like lists of kendraPartipants, their addresses, their servers, their songs, their songs' business rules, their servers, which songs on which servers, etc? And can we easily extend the data model to allow for new objects bolted on?

For example we want record labels to share their song metadata for a global catalogue. Now, traditionally, before we allow them to input anything we need them to tell us what their data structure is. But, hold on, does that mean we need to talk to every record label to build a global data structure? Can't do it that way not practical. So, we need to enable any new record label coming in the the project to describe their own data structure and how it relates to what's currently their. Yes? How do we do that?

Well "pages" can be be anything, and they can be linked in many different ways: the free-form Wiki structure can work alongside table-driven data structures, or the Wiki data be treated as raw material for data mining into a more rigid form. Wikipedia articles tend to follow style rules: by using conventions in free-form material, they can be parsed to build up relational indices, which can then generate auto-generated content, web services etc.


* Syndication: RSS feeds already running.
* Comment feedback: yes
* Per-user namespaces: yes

Other advantages:
* highly customizable


All cool.

* existing developer and 1000+ user base
* current database exceeds 100,000 articles with full histories


Above 2 points only relevant if we merge, yes?

The point here was: it scales well (don't look today, though, they're debugging new database support -- see below -- and it's slow at the moment).


* written in PHP
* bag database on Sourceforge
* support for image download and TeX for formulas
* tested under high loads (300,000 hits/day +)
* supports interactive content updates (2000+ edits/day)
* currently runs on a single 2-cpu server running Linux


All cool.

* uses clean client-server design, so scalable to multiple page-servers
sharing a database


We need totally distributed databases. Each company/organisation/group/individual may want to host their own data and we need to be able to cope with that.

Inter-wiki links should be able to handle that.


* provision for page caching
* GPL licence


If we just use what they give us then fine but if we want to develop the code under our own branch then we'll have to use the GPL. The GPL is great for what it wants to do. However, Kendra's aim is about getting kendraSystem created and I can only see the best way of doing that is to release as public domain or as close as we can legally get to that. So, EVERYONE can then use the code in ANYTHING they produce and if they can make money from it then so much the better. Yes?

Anyone can use GPL'd code to make money. Even _Microsoft_ ships GPL'd code for their Unix tools for Windows (yes, really).

As I understand it, only code derived from, or directly linked to, GPL'd code is required to be GPLd. Using a GPL'd engine does not require the use of GPL on subsystems built to run on it: for example, Linux can be used to run proprietary software, without any GPL restrictions.

Finally, the current Wikipedia software is the work of only a few people, and they _might_ be willing to let it be relicensed under, say, the BSD licence if we ask them nicely, and give them a really good explanation of why we need it.


Drawbacks:
* currently runs on MySQL, although a port to PostgreSQL has been discussed


What's wrong with MySQL?

Lack of proper transaction support (and yes, they know about InnoDB, and are transitioning to it as I type).



* performance is currently limited by write-locking, although a
shift to  Postgres should fix this


What's wrong with write-locking? Does it freeze the whole database momentarily?

Yes. Think of having a queue of read-only database operations, with a write operation somewhere in the middle. The write operation must lock a load of tables, to avoid inconsistency. Before it can do that, all read ops ahead of it using the same tables must first complete. This in turn blocks all the read ops after it in the queue: so things are serialized, and the system cannot make progress until the locks are released.

Smart databases can get around this, by using clever programming and careful interpretation of the ACID properties of database semantics. MySQL is too dumb to do this. This does not matter in a read-mostly environment without a need for proper transactions (where MySQL is great!), but where you are doing '000s of updates per day on a site with lots of global data dependencies like a Wiki, it's a big performance hit.

PostgreSQL is much better in systems with a high update level: it uses multi-versioning instead of locks wherever possible.

Regards,

Neil