Baloo/XapianProblems

From KDE Community Wiki


  • Licensing - Xapian is released under the GPL license. Baloo wants to be a framework and be released under LGPL. Earlier versions of Baloo used Xapian via a plugin infrastructure. This is no longer the case. It can therefore be argued that Baloo is also GPL.
  • Concurrent Access - Xapian is very bad at concurrent access.
    • One global write lock
    • This write lock is obtained via creating a Xapian::WritableDatabase class. This means if a process wants to modify an existing document, even while they are reading the document and performing the modifications in memory, no other process can write to the database. Baloo therefore often keeps a readonly connection to Xapian and when it wants to do a write, opens a WritableDatabase, writes and then closes it. This is however expensive as the entire document needs to be read from the read-only db, and then written from the write-db. Also, internal database caches are never shared.
    • If anyone does any writes to the Xapian Database, processes which are reading the data can potentially encounter a Xapian::DatabaseModfiied exception. This requires the client to re-open the database and redo whatever they were doing. It results in ugly code in Baloo where all Xapian usage is enclosed in a `while (1) { try { .. } catch {} }` block. Also in many cases restarting what you're doing is too expensive.
    • If there are many writes occurring then processes reading will keep encountering this DatabaseModified exception, if they are trying to read the same chunk.
  • Exception Reliant It heavily relies on exceptions. Exceptions are not well supported in Qt and might make the application crash as mentioned here. For example while locking a database Xapian expects the program to catch certain exceptions and retry if they are caught.
    • Example - You cannot check if a document exists in a database, you need to try and fetch it. If it throws an exception, then it does not exist.
  • It does not handle data that is changing frequently, if data in document changes to frequently it can lead to a conditions in which locking the database for writing becomes impossible thus making baloo fail.
  • No Text normalization support - Baloo needs support for normalizing text i.e. removing all diacritic marks and also needs to split words with '_' to generate terms, Xapian's term generator doesn't provide support for either. So baloo uses its own term generator.
  • No Proper Support for startsWith - Xapian does not support searching for words starting with 'x'. The way to do this is to request every term start with 'x', and then do an OR search for each of these words. Eg - x OR xorg OR xdan ... This results in excessive data-reads which aren't really required. Also, quite often this list can get huge. We have special techniques on the Baloo side for dealing with this. But it is not cheap.
  • No Support for Comparison Operators - Xapian only focuses on looking up terms. If one wants any kind of comparison operators on any term, one needs to iterate over each of those terms and check it manually. This makes doing queries such as "rating < 4" potentially expensive. It's even more expensive to do date queries. 'modified < 2014-12-05'. Ideally, Xapian could store extra btrees for each value and allow efficient searching on them.
  • Modifications are expensive - Due to Xapian's internal btree structure making modification on documents is very expensive. It results in the entire document being loaded into memory and then being written back. This causes high disk i/o, specially for files with large quantities of text.
  • Stores each word on disk thrice - Xapian's chert backend stores each word thrice. This can easily be reduced to two.
    • Posting List - Table mapping word -> document-id.
    • Term List - document-id -> all the words
    • Position List - (document-id + word) -> positions

It would make more sense to combine the TermList and PositionList table.

  • Loads entire documents in memory - If one needs to modify a document, Xapian will loads all the terms + positions in memory first. This can get very expensive. It would be better if documents were stored in a way such that only part of them could be loaded and modified.
  • Locking the Database is hard If one cannot lock the db, we just get an exception thrown. We therefore need to implement our own exponential backing algorithm to lock the database.