Baloo/XapianAlternatives: Difference between revisions

Latest revision as of 23:37, 23 August 2024

Just documenting what all we could use if we decide to move away from Xapian.

Full Text Indexing Solutions

Clucene

The project seems kind of dead.

Lucene++

Sphinx

Sphinx is a standalone search engine providing size-efficient and relevant full-text search functions to other applications. Sphinx integrates with SQL databases and scripting languages. Data source drivers support fetching data either via direct connection to MySQL, PostgreSQL, or from a pipe in a custom XML format. The Search API is natively ported to PHP, Python, Perl, Ruby, Java, and also available as a pluggable MySQL storage engine. Sphinx is an acronym which is officially decoded as SQL Phrase Index.

https://sphinxsearch.com/

And now for something completely different

One other option would be to use NLP techniques. This has the distinct advantage of being able to capture context of use, ability to search for a concept and pick up synonyms, and even probabilistic linkage to acronyms and abbreviations based on surrounding context. This takes more time than indexing, but could be amortized by having background processes run when documents are saved for those cases where downtime (e.g. set to run at 3a every day) indexing isn't an option, or where real-time search was desired.

It is the level of technology needed if you want to find documents that pertain to your dog, but don't want to see every email that says you were "dog tired" or "sick as a dog" (NLP can recognize idioms).

Tools like WordNet, as well as providing the option for use of specialized vocabularies for engineers, scientists, physicians, lawyers and others who work in domains which have complex and detailed specialized vocabulary--when those available.

For example, in medical field, vocabulary like SNOMED-CT has about 320K terms which are organized in an ontology-like hierarchy (there is an OWL representation, but it is not sufficiently specified to be considered a machine processable ontology) but using the controlled language offers addition query and indexing options based on concepts and context, not just stemming and tuple indexing.

Why not Full Image Indexing?

Of course, with the capabilities in openCV and various ML libraries, we could just as easily do the same with photographs--pick out who is in them, a list of objects/animals/plants (even if you can't identify them), and if there is GPS metadata, even location and time frame. Maybe next file parsers can integrate with KPhotoalbum in this regard.

Key/Value Stores

Another option is to build our own full text indexing solution. We don't want to do all the hard work of the lower level file access, concurrency, paging, etc. It makes more sense to build it on top of some existing btree solution.

Berkley DB

License: GPL
Language: C