Baloo: Difference between revisions

From KDE Community Wiki
No edit summary
(→‎Indexing limitations: fix bug #, mention iconv)
 
(49 intermediate revisions by 13 users not shown)
Line 1: Line 1:
Baloo is the next generation of the Nepomuk project. It's responsible for handling user metadata such as tags, rating and comments. It also handles indexing and searching for files, emails, contacts, etc. Baloo aims to be lighter on resources and more reliable than its parent project.
[[File:Mascot konqi-support-search.png|thumbnail|right|Help [[Konqi]] find what he wants!]]
Baloo is the file indexing and file search framework for KDE Plasma, with a focus on providing a very small memory footprint along with with extremely fast searching.


== What's wrong with Nepomuk? ==
== User documentation ==


The Nepomuk project started as a research project in the European Union. It was build completely on top of RDF. While RDF is a great from a theoretical point of view, it is not the simplest tool to understand or optimize. The databases which currently exist for RDF are not suited for desktop use.
[https://github.com/KDE/baloo/blob/master/docs/user/searching.md User documentation] for search types and document properties that Baloo indexes


The Nepomuk developers have tried very hard over the last years to optimize the indexing and searching infrastructure, and they have now come to the conclusion that Nepomuk cannot be further optimized without migrating away from RDF.
== Ways to communicate ==
:Mailing List: [email protected] ([https://mail.kde.org/mailman/listinfo/kde-devel info page])
:IRC Channel: [https://web.libera.chat/ #kde-devel on Libera Chat]
:Phabricator project: https://phabricator.kde.org/project/view/261


RDF also heavily relied on ontologies. These ontologies are a way to describe how the data should be stored and represented. They used the ontologies from the original EU research project - Shared Desktop Ontologies. These ontologies were not designed from a performance or ease of use point of view. They are quite vague in certain areas and often duplicate information. This leads to scenarios where it takes forever to figure out how the data should be stored. Additionally, since all the data needs to be stored in RDF, one cannot optimize for one specific data type.
== Top bugs and feature requests ==
'''Bugs:''' https://bugs.kde.org/buglist.cgi?bug_severity=critical&bug_severity=grave&bug_severity=major&bug_severity=crash&bug_severity=normal&bug_severity=minor&bug_status=UNCONFIRMED&bug_status=CONFIRMED&bug_status=ASSIGNED&bug_status=REOPENED&list_id=1629910&priority=VHI&priority=HI&product=frameworks-baloo&query_format=advanced
<br/><br/>
'''Feature requests:''' https://bugs.kde.org/buglist.cgi?bug_severity=wishlist&bug_status=UNCONFIRMED&bug_status=CONFIRMED&bug_status=ASSIGNED&bug_status=REOPENED&list_id=1629911&priority=VHI&priority=HI&product=frameworks-baloo&query_format=advanced


Given these shortcomings the Nepomuk developers decided to drop RDF and rechristen the project under the name of Baloo.
== Indexing limitations ==
Baloo uses the file metadata extractors in [https://invent.kde.org/frameworks/kfilemetadata KFileMetadata] to get information about each file it indexes.
This means for a file's content to be indexed
* the file must have a recognizable MIME type
* KDE must have an extractor for that MIME type. Use the command-line utility <code>kmimetypefinder5</code> to determine a file's mime type.
** Due to a [https://gitlab.gnome.org/GNOME/glib/-/issues/2511#note_1293471 glib bug], the MIME type of HTML files can change from  <code>text/html</code> to <code>application/x-extension-html</code>. The KDE file metadata extractors don't recognize the latter. That bug has a workaround to reset the MIME types to the usual values.
* KFileMetadata uses the aging utilities <code>catdoc</code>, <code>xls2csv</code>, <code>catppt</code> to index content of files using the Microsoft Office Word, Excel, and PowerPoint file formats ([https://invent.kde.org/frameworks/kfilemetadata/-/blob/master/src/extractors/officeextractor.cpp#L20 source]), and these utilities have undocumented limitations ([https://bugs.kde.org/show_bug.cgi?id=438455 bug 438455]).
** [http://www.wagner.pp.ru/~vitus/software/catdoc/ catdoc home page]
** [https://bugs.debian.org/cgi-bin/pkgreport.cgi?repeatmerged=no&src=catdoc Debian's bug list for catdoc]; [https://bugzilla.redhat.com/buglist.cgi?quicksearch=catdoc RedHat's bug list for catdoc]
* KFileMetadata does not index file names or file contents in ZIP archives.
* KFileMetadata does not index the contents of Open Document Format files that are ZIP archives, nor does it index "flat" Open Document Format files that are complex XML files.


= Migrating from Nepomuk to Baloo =
Other limitations:
* Baloo doesn't index text files (those whose MIME type is detected as "text/''something''") over 10 MB ([https://invent.kde.org/frameworks/baloo/-/blob/master/src/file/extractor/app.cpp#L143 source]).
* The KFileMetadata extractor for text attempts to convert text to Unicode. If the file uses another encoding, such as iso-8859-1, any file contents after the first character that is invalid in Unicode will not be indexed ([https://bugs.kde.org/show_bug.cgi?id=440537 bug 440537]). You may find the <code>-i</code> option to the <code>file</code> command-line utility useful; it tries to infer the character set of a file, e.g. <kbd>file -i ''path/to/myfile.txt''</kbd>. You can use the <code>iconv</code> command-line utility to report invalid encodings and convert encodings to UTF-8.
* If a file's modification time is January 1 1970 ("zero" in the Unix epoch) or earlier, baloo will reindex it each time it starts (or you run <kbd>balooctl check</kbd>) ([https://bugs.kde.org/show_bug.cgi?id=456108 bug 456108]), and <code>balooshow</code> will be confused about the file's "Mtime" if it is before January 1 1970. As a workaround you can change th e modification time to something after 1970, e.g. <kbd>touch -m --date=2022-01-01 path/to/myfile</kbd>.
* [https://discuss.kde.org/t/how-do-i-troubleshoot-baloo/2830/12 Some users] report that baloo doesn't properly index some files extracted from zip or JAR files. A workaround is to clear them from baloo's index then reindex them. with <kbd>balooctl clear ''/path/to/file''</code> then <kbd>balooctl index ''/path/to/file''</kbd> .


Nepomuk was used to store the tags, ratings, and user comments in Files. This data can be migrated by running the <tt>nepomukbaloomigrator</tt>. Nepomuk was also used to store indexed information about Files, Emails and Contacts. Baloo shall reindex this information directly from the source.
== Other Baloo pages here ==
Information may be obsolete.
{{Special:PrefixIndex/{{FULLPAGENAME}}/}}


= Running Nepomuk and Baloo together =
== Using Baloo ==


Nepomuk and Baloo can both coexist perfectly. However, it may not be the best idea to run both of them on the same system as they both would then be indexing your files, emails and other data.
Baloo is not an application, but a daemon to index files.  Applications can use the Baloo framework to provide file search results. For example, [[Dolphin]]'s Content search can use Baloo.


Tags, ratings and comments will be not be synchronized between Nepomuk and Baloo after the initial migration.
KDE System Settings > File Search provides an [http://vhanda.in/blog/2014/04/desktop-search-configuration/ intentionally limited number of settings].  You can make additional adjustments in [[Baloo/Configuration | Baloo's configuration file]].


Applications relying on Nepomuk will have to migrate to Baloo. Their progress can be tracked over here - http://community.kde.org/Baloo/NepomukPort
== balooctl ==


= Baloo, Nepomuk, KDE4 and KF5 =
<code>balooctl</code> is a CLI command to perform certain operations on Baloo. Enter <code>balooctl --help</code> in a terminal app such as [[userbase:Konsole]] to list its available subcommands.


The Nepomuk project will not be ported to Qt5 and KF5. The Baloo project will be ported to KF5. This ported version of Baloo will continue to use the same database as the KDE4 version and will be completely compatible.
See also [[Baloo/Debugging]].

Latest revision as of 04:03, 9 March 2024

Help Konqi find what he wants!

Baloo is the file indexing and file search framework for KDE Plasma, with a focus on providing a very small memory footprint along with with extremely fast searching.

User documentation

User documentation for search types and document properties that Baloo indexes

Ways to communicate

Mailing List: [email protected] (info page)
IRC Channel: #kde-devel on Libera Chat
Phabricator project: https://phabricator.kde.org/project/view/261

Top bugs and feature requests

Bugs: https://bugs.kde.org/buglist.cgi?bug_severity=critical&bug_severity=grave&bug_severity=major&bug_severity=crash&bug_severity=normal&bug_severity=minor&bug_status=UNCONFIRMED&bug_status=CONFIRMED&bug_status=ASSIGNED&bug_status=REOPENED&list_id=1629910&priority=VHI&priority=HI&product=frameworks-baloo&query_format=advanced

Feature requests: https://bugs.kde.org/buglist.cgi?bug_severity=wishlist&bug_status=UNCONFIRMED&bug_status=CONFIRMED&bug_status=ASSIGNED&bug_status=REOPENED&list_id=1629911&priority=VHI&priority=HI&product=frameworks-baloo&query_format=advanced

Indexing limitations

Baloo uses the file metadata extractors in KFileMetadata to get information about each file it indexes. This means for a file's content to be indexed

  • the file must have a recognizable MIME type
  • KDE must have an extractor for that MIME type. Use the command-line utility kmimetypefinder5 to determine a file's mime type.
    • Due to a glib bug, the MIME type of HTML files can change from text/html to application/x-extension-html. The KDE file metadata extractors don't recognize the latter. That bug has a workaround to reset the MIME types to the usual values.
  • KFileMetadata uses the aging utilities catdoc, xls2csv, catppt to index content of files using the Microsoft Office Word, Excel, and PowerPoint file formats (source), and these utilities have undocumented limitations (bug 438455).
  • KFileMetadata does not index file names or file contents in ZIP archives.
  • KFileMetadata does not index the contents of Open Document Format files that are ZIP archives, nor does it index "flat" Open Document Format files that are complex XML files.

Other limitations:

  • Baloo doesn't index text files (those whose MIME type is detected as "text/something") over 10 MB (source).
  • The KFileMetadata extractor for text attempts to convert text to Unicode. If the file uses another encoding, such as iso-8859-1, any file contents after the first character that is invalid in Unicode will not be indexed (bug 440537). You may find the -i option to the file command-line utility useful; it tries to infer the character set of a file, e.g. file -i path/to/myfile.txt. You can use the iconv command-line utility to report invalid encodings and convert encodings to UTF-8.
  • If a file's modification time is January 1 1970 ("zero" in the Unix epoch) or earlier, baloo will reindex it each time it starts (or you run balooctl check) (bug 456108), and balooshow will be confused about the file's "Mtime" if it is before January 1 1970. As a workaround you can change th e modification time to something after 1970, e.g. touch -m --date=2022-01-01 path/to/myfile.
  • Some users report that baloo doesn't properly index some files extracted from zip or JAR files. A workaround is to clear them from baloo's index then reindex them. with balooctl clear /path/to/file then balooctl index /path/to/file .

Other Baloo pages here

Information may be obsolete.

Using Baloo

Baloo is not an application, but a daemon to index files. Applications can use the Baloo framework to provide file search results. For example, Dolphin's Content search can use Baloo.

KDE System Settings > File Search provides an intentionally limited number of settings. You can make additional adjustments in Baloo's configuration file.

balooctl

balooctl is a CLI command to perform certain operations on Baloo. Enter balooctl --help in a terminal app such as userbase:Konsole to list its available subcommands.

See also Baloo/Debugging.