View source for Projects/Nepomuk/4.11

This page documents the rough roadmap for the KDE Workspaces 4.11 release.

=Graph Mergers=
Nepomuk currently creates a new graph during each transaction. This results in a large number of graphs. We can reduce the number of graphs required by limiting the total number of graphs -

* 1 graph per application
* 1 discard-able graph per application 

If two applications choose to store the same data, then that data will exist in multiple graphs at the same time.

This should result in a massive speed increase of removeDataByApp, and the overall code simplification of the data management model.

=File Watcher=
The FileWatcher currently only has one inotify backend which provides us with all the functionality that we need. Minus, the part where we need to install a large number of watches.

It might be better to add some more backends based on the features they provide, and all the users to have a mix of all of the above. Possible backends are -

* inotify - Provides all functionality
* fanotify - Provides new file creation / deletion
* kio - Provides file moves ONLY for kde

One could allow the users to choose which backend combinations they want. For example - A good default might be - kio + inotify (without moves). This way the number of inotify watches would also be very low (Depending on the number of indexed directories).

Or one could even have fanotify + kio which would only require 1 fanotify watch for file creation and deletion monitoring.

=Cleaner GUI=

The Nepomuk Cleaner currently has a horrible GUI. Considering that it is an important tool, it should be improved by doing the following -

Dividing the jobs based on -
* Data Migration
* Duplicate detection
* Invalid data removal
* Data removal

The Data Migration task should only need to be run once. The others can be run when the user wants. It would ideally be nice to have some kind of QWizard based UI where the user could choose exactly what jobs to run, and maybe even review the data before acting on it.

=Nepomuk Tools=

Nepomuk should ship with some simple tools to control nepomuk. The most important ones are -
* NepomukCtl - A simple tool to start, stop and restart Nepomuk and any of its services.
* NepomukShow - A simple tool to dump the nepomuk data onto the terminal. Giving users queries is hard. It's better to have some specialized tool for this.

NepomukShow already exists in vhanda's scratch repo. It should be polished and then shipped with nepomuk-core.

=Nepomuk Service Management=

Currently each Nepomuk Service is run under a <tt>nepomukservicestub</tt> executable. This makes it hard for the users to provide accurate debugging information cause the process name they see if <tt>nepomukservicestub</tt>. Also, it makes it harder to run per-service optimizations.

It might be better to move towards each service having their own process. This would have the following benefits -

* Each Nepomuk service would have their own easily recognizable process.
* The service could be a KApplication or QCoreApplication depending on its needs.

Currently, all services are KApplications cause the file indexer needs to create an instance KIdleTime which requires widgets. This results in a lot of extra memory being spent in loading QWidget stuff, which is just not required.

Also, with this we can rename the nepomukserver to nepomuk_control or something. It's stupid to call something the server when it doesn't do the job of a server.

=Nepomuk Shell=

The Nepomuk shell currently does not provide much value. It should ideally help with debugging Nepomuk. That means providing a list of running queries, and logs to show which data management functions have been called.

Preferably something like this -

* StoreResources - With this graph
* AddProperty - resUri - property - value
* SetProperty - resUri - property - value
* RemoveDataByApplication

One can then select that individual operation to see what all data was passed into store resources and what was the result of the operation.

=Pure Socket Communication=

Currently we use a mix of dbus and a local socket for clients to communicate with the Nepomuk Storage. This is not good. Specially since dbus was not designed for data-transfer. We use it to transfer query results (QueryServiceClient) and indexing data (StoreResources).

There is already a somewhat proof of concept of this in nepomuk and soprano in the branch customCommandsOverSockets. It works, but it's messy cause this would involve both blocking on non-blocking communication over the same socket.

Maybe we should move to a completely asynchronous socket communication? How would we go about that? It would require moving away from the Soprano Model concept. Can we really do that?

Maybe we could create our own client-server communication and not use Soprano? How would that help?

=Resource Identifier Unit tests=

The ResourceIdentifier really needs some unit tests to make sure stuff is getting identified properly. Specially from stuff like emails, series, seasons, etc. We have all noticed that it acts a little strange at times.

=File Indexer=
The File Indexer needs the following changes -
==Better Plugin System==
===Plugin based versioning===
Each Plugin should be able to set its version number and indicate if those files should need to be reindexed if the plugin is updated. We can very easily do this considering that each plugin operates on a set of mimetypes.

One could store the following info in the nepomukstrigirc -
[Plugins]
ffmpegPlugin=0.1
taglibPlugin=0.1

When the plugin version number is updated, we remove all the indexed data for the files with those mimetypes and then reindex them.

===Plugin mimetype priority===
Different plugins can support a number of mimetypes. Each plugin should give a number indicating how confident it is on handling the mimetype.

For example both taglib and ffmepg can both handle mp4 files. But ffmpeg handles them better. However, some distros do not ship ffmpeg extractors.

==Okular based Indexer==

Okular handles a number of document types. However, it uses QWidgets which could pop up and ask for a password. We need to rewrite Okular to allow us to use its backeneds directly.

=Web-Miner=

We need ship this with 4.11. What all needs to be done?

WebMiner location: https://projects.kde.org/projects/extragear/base/nepomuk-webminer

==="kinda" blocker:===
Execution of the KAction python scripts blocks the user ui. The problem at the moment, the PythonInterpreter must run in the same thread in which the action is executed or the signal/slot stuff does not work (PyThread_Lock exception). This should be solved (if possible).

===Optional:===
While the "tvdb with anime lookup" script and the imdb script will not be part of KDE SC (stay in a extragear repo with for additional scripts) due to it heavily dependency and problems during execution, we might need some more plugins for other websites.

Available at the moment:
* the movie db (for all movies)
* the tvshow db (for all tv shows) (has some problems with anime and different names)
* Microsoft Academics (general pdf search)
* SpringLink (general pdf search)
* The nature db (general pdf search)

=Controller UI=

Nepomuk users never know what is going on. We need to show them detailed info of what is going on. Jorg already has some mockups, but it could use more work. We need to make a list of stuff that needs to be done.

All necessary data is exposed via dbus.
What could be added are signal that send some progress information. But we should define how to define the 0-100% exactly.

In case we do go with the qml stuff, someone needs to enhance the qml files.
The DataEngine + ServcieEngine is working and will be changed according to the FileWatcher/FileIndexer patches when available.

See current nepomukcontroller-qml prototype in my scratch repo: 
http://quickgit.kde.org/?p=scratch/jehrichs/nepomukcontroller-qml.git

=Query Change Monitoring=

Currently the Query Service reruns each query when relevant data changes. Re-running the entire query is not exactly practical. We could instead append this to the query -

?r nao:lastModified ?m . FILTER( ?m > DateTimeOfLastQueryExecution ) .

So that we only get query results which are after a certain time interval. This obviously doesn't handle data removal.

Data changes can be divided into 2 parts -
* Data addition
* Data Removal
** Specific data is removed
** The entire resource is removed

We can easily handle data addition with the nao:lastModified trick. And data deletion can also easily be handled. How do we handle specific data removal? I don't think this can be generically done.

One way of handling this is keeping track of which all resources had some data removed and re-running the query only against those resources - FILTER( ?r in (<..>, <...>) ) . If the list is too long then we can just re-run the entire query?