Baloo/XapianProblems: Difference between revisions

From KDE Community Wiki
m (fix typo)
(Added concurrent access problems + value problems)
Line 1: Line 1:
=== Problems with Xapian ===
= Problems with Xapian =
 
* '''Licensing''' - Xapian is released under the GPL license. Baloo wants to be a framework and be released under LGPL. Earlier versions of Baloo used Xapian via a plugin infrastructure. This is no longer the case. It can therefore be argued that Baloo is also GPL.
 
* '''Concurrent Access''' - Xapian is very bad at concurrent access.
** One global write lock
** This write lock is obtained via creating a Xapian::WritableDatabase class. This means if a process wants to modify an existing document, even while they are reading the document and performing the modifications in memory, no other process can write to the database. Baloo therefore often keeps a readonly connection to Xapian and when it wants to do a write, opens a WritableDatabase, writes and then closes it. This is however expensive as the entire document needs to be read from the read-only db, and then written from the write-db. Also, internal database caches are never shared.
** If anyone does any writes to the Xapian Database, processes which are reading the data can potentially encounter a Xapian::DatabaseModfiied exception. This requires the client to re-open the database and redo whatever they were doing. It results in ugly code in Baloo where all Xapian usage is enclosed in a `while (1) { try { .. } catch {} }` block. Also in many cases restarting what you're doing is too expensive.
** If there are many writes occurring then processes reading will keep encountering this DatabaseModified exception, if they are trying to read the same chunk.


* It heavily relies on exceptions. Exceptions are not well supported in Qt and might make the application crash as mentioned [http://qt-project.org/doc/qt-5/exceptionsafety.html here]. For example while locking  a database Xapian expects the program to catch certain exceptions and retry if they are caught.
* It heavily relies on exceptions. Exceptions are not well supported in Qt and might make the application crash as mentioned [http://qt-project.org/doc/qt-5/exceptionsafety.html here]. For example while locking  a database Xapian expects the program to catch certain exceptions and retry if they are caught.
 
** Example - You cannot check if a document exists in a database, you need to try and fetch it. If it throws an exception, then it does not exist.
*If we want to read and write to an Xapian database simultaneously we need to keep separate copies for reading and writing, thus wasting memory.


*It does not handle data that is changing frequently, if data in document changes to frequently it can lead to a conditions in which locking the database for writing becomes impossible thus making baloo fail.
*It does not handle data that is changing frequently, if data in document changes to frequently it can lead to a conditions in which locking the database for writing becomes impossible thus making baloo fail.
Line 9: Line 16:
*Baloo needs support for normalizing text i.e. removing all diacritic marks and also needs to split words with '_' to generate terms, Xapian's term generator doesn't provide support for either. So baloo uses its own term generator.
*Baloo needs support for normalizing text i.e. removing all diacritic marks and also needs to split words with '_' to generate terms, Xapian's term generator doesn't provide support for either. So baloo uses its own term generator.


*While searching for something the user may not type complete words so we need to look for every possible expansion of the words in a query,  xapian doesn't provide this feature so we're using our own query parser.
* Xapian does not support searching for words starting with 'x'. The way to do this is to request every term start with 'x', and then do an OR search for each of these words. Eg - x OR xorg OR xdan ... This results in excessive data-reads which aren't really required. Also, quite often this list can get huge. We have special techniques on the Baloo side for dealing with this. But it is not cheap.
 
* Xapian only focuses on looking up terms. If one wants any kind of comparison operators on any term, one needs to iterate over each of those terms and check it manually. This makes doing queries such as "rating < 4" potentially expensive. It's even more expensive to do date queries. 'modified < 2014-12-05'. Ideally, Xapian could store extra btrees for each value and allow efficient searching on them.

Revision as of 15:42, 6 December 2014

Problems with Xapian

  • Licensing - Xapian is released under the GPL license. Baloo wants to be a framework and be released under LGPL. Earlier versions of Baloo used Xapian via a plugin infrastructure. This is no longer the case. It can therefore be argued that Baloo is also GPL.
  • Concurrent Access - Xapian is very bad at concurrent access.
    • One global write lock
    • This write lock is obtained via creating a Xapian::WritableDatabase class. This means if a process wants to modify an existing document, even while they are reading the document and performing the modifications in memory, no other process can write to the database. Baloo therefore often keeps a readonly connection to Xapian and when it wants to do a write, opens a WritableDatabase, writes and then closes it. This is however expensive as the entire document needs to be read from the read-only db, and then written from the write-db. Also, internal database caches are never shared.
    • If anyone does any writes to the Xapian Database, processes which are reading the data can potentially encounter a Xapian::DatabaseModfiied exception. This requires the client to re-open the database and redo whatever they were doing. It results in ugly code in Baloo where all Xapian usage is enclosed in a `while (1) { try { .. } catch {} }` block. Also in many cases restarting what you're doing is too expensive.
    • If there are many writes occurring then processes reading will keep encountering this DatabaseModified exception, if they are trying to read the same chunk.
  • It heavily relies on exceptions. Exceptions are not well supported in Qt and might make the application crash as mentioned here. For example while locking a database Xapian expects the program to catch certain exceptions and retry if they are caught.
    • Example - You cannot check if a document exists in a database, you need to try and fetch it. If it throws an exception, then it does not exist.
  • It does not handle data that is changing frequently, if data in document changes to frequently it can lead to a conditions in which locking the database for writing becomes impossible thus making baloo fail.
  • Baloo needs support for normalizing text i.e. removing all diacritic marks and also needs to split words with '_' to generate terms, Xapian's term generator doesn't provide support for either. So baloo uses its own term generator.
  • Xapian does not support searching for words starting with 'x'. The way to do this is to request every term start with 'x', and then do an OR search for each of these words. Eg - x OR xorg OR xdan ... This results in excessive data-reads which aren't really required. Also, quite often this list can get huge. We have special techniques on the Baloo side for dealing with this. But it is not cheap.
  • Xapian only focuses on looking up terms. If one wants any kind of comparison operators on any term, one needs to iterate over each of those terms and check it manually. This makes doing queries such as "rating < 4" potentially expensive. It's even more expensive to do date queries. 'modified < 2014-12-05'. Ideally, Xapian could store extra btrees for each value and allow efficient searching on them.