KDE PIM/Akonadi Next/Design

A Proposed System Design

Each syncronizer will know how to work with a specific source. For each configured instance of that kind of source, a store will be created which will be managed by an instance of the syncronizer. Syncronizers will run in their own process and be responsible to writing changes to the store, syncronizing these changes with the source and syncronizing the store with changes made in the source. Syncronizers will be plugins implementing a small but well defined API which allows for command input and notification of changes in the store.

A shell application will load the configuration for a resource and dynamically load the correct syncronizer and instantiate the correct pipelines. This shell will also open a local socket that follows a convention allowing clients to address a given resource's shell. The shell will be responsible for communication between clients and the resource, queueing changes for later syncrhonization, and running the pipelines for each class of command.

Stores will

be simple key/value stores
support multi-reader, single-writer
values representing messages will be binary buffers that are directly readable using flatbuffers (privisional choice for buffer impl)
values of secondary indexes will be keys that map to the message entry
support tree objects making folder trees or threaded message list structures readable in single, non-recursive requests
support lazy loading: entries can be marked as needing to be loaded (e.g. an email folder name but no messages)
provide an incrementing revision ID on every message record and tree object; the current maximum revision ID is the store's current revision
keep track of the unsyncronized revision IDs (a set of revision endpoints, e.g.: 100-150, 160-170 would imply revisions less than 100 and between 151 and 159 are sync'd; the empty set means everything is sycronized); this is the store's sync state

Clients will read directly from the stores using a client library that implements:

resource enumeration (e.g. from local configuration data)
query of stores
discovery of shell sockets
command delivery to shells
keeps record of the last revision ID of a store it has seen

When reading from a store, the client library will:

locate the store and open it for reading
check if the shell socket is accepting connections
- if so, connect to it for change notification
- if not, watch for the local socket to become available
if the local socket becomes available (possibly following being closed), connect automatically
send all modification commands to the shell
- commands are queued and sent one at a time to shell
  - the shell may request more commands before it completes the last one to drain client-side queues
  - the client-side queue should be considered transient: if the client exits before the queue is drained, the queue is simply discarded
- if the shell socket is not available, it will start the shell (this must allow for multiple clients attempting to do this simultaneously)
release all resources when the client is finished with a resource

Sycronizers and filters may update the store via a helper API in the shell application. This helper API will ensure that the revision ID on message records is updated accordingly.

When a client comes across a to-be-loaded data set in the store it will:

ensure the shell is started
when a connection has been established request loading by the resource of the data set in question

This allows each syncronizer to define what gets lazy loaded and how without requiring it.

When a message is requested:

the message buffer will be loaded from the store and handed to a flatbuffer object, giving direct access to the data
each buffer will contain an identifying tag noting which synronizer is responsible for it
the message read interface of the syncronizer will be loaded into the client at this point (if not already done)
the flatbuffer will be passed through this interface for adaptation to an appropriate application-visible API, such as a C++ calendar item class
- this also allows for locally stored data (e.g. in a maildir) to not require the whole dataset to be replicated in the store as the syncronizer can load the relevant mail from disk at this point with the message buffer in the store merely providing the correct local path to the mail

This also provides an innexpensive way to get at only the data necessary, such as for mail header lists, as flatbuffers can read from a memmap'd buffer without parsing the entire contents.

Commands include the following classes:

Syncronize: this triggers a syncronization of the store with the source; this may result in the store's revision increasing
Modify: this replaces the existing data in the store with the new data; this causes the store's revision to increase
Delete: removes a message entry from the local store and adds a removal entry; this causes the store's revision to increase
Create: this adds a new message to the store; this causes the store's revision to increase

On syncronization:

The syncronizer repicates changes in the source to the store
The syncronizer replicates changes in the store to the source
- This may involved conflict resolution
- This may spawn sets of change commands (modify, delete, create) for the store (which puts source and client mutations through the same pipelines)
- The syncronized ID is updated in the store sync state to reflect progress and prevent duplicate replays

Modify/delete/create commands result in:

The change being queued in the store by the syncronizer
The relevant pipeline being run which may create further changes to the message
The resulting message is recorded in the store
If not the result of sycronization, the sycronizer may then elect to reflect the changes immediately in the source (reflected in the store's sync state)

Pipelines include:

Message mutation (e.g. local mail filter, spam/scam detection)
Content indexing (e.g. full text indexing)
Update indexes (add to index on new, change indexes on modify, remove from indexes on delete)

On store revision increase:

The shell notifies every client connected to the local socket of the new revision
Clients request, at their leisure, all changes since their last revision
- This is the point when a client will show the user the modifications they requested (e.g. marking a mail as read) ... which implies these roundtrips have to be damn fast, but which also implies that the client always reflects the current real state of the store!
Once all clients have recieved removal entries (satisfied also by all clients disconnecting) and the removal entry has been sycnronized, it is removed from the store

Shell lifecycle management:

If a shell crashes, the client library is responsible for restarting it on next command request (or repeating the last command?)
Each client that connects and send a command is flagged as such in the shell's connection bookkeeping
When the last client that sent a command disconnects an automatic shutdown algorithm is begun:
- all pipelines are drained (if in progress) and no more actions are allowed to be started
- if a command from one of the remaining clients (or a newly connected lient) arrives, the algorithm is aboted and the shell goes back to normal operaton
- new client connections are refused
- any write handle to the store is closed
- all remaining clients (which by definition have not sent a command, so are only listening) are disconnected
- the shell process exit()s

Autostart on command queueing on the client side will restart the shell automatically and cause all non-commmand issuing clients to re-connect as well. This allows resources to come and go on an as-needed basis as they are only required for processing commands and updating clients of revision id updates.

With the above, all that is required is the shell application, a resource plugin API, a filter plugin API and a client-side library. All reads will happen in the client process (though potentially in a separate thread in that process), all writes will happen in the auto-started shell process. No central server is required. No database process is required.

This means all data retrieval and colation is done on the client-side. For instance:

Colating results between different resources will be a client-side task. (Projet view...)
Requesting all emails in a given folder will be done via a query processed client-side (note: query!)
Requesting all calender events from a given set of calendars for a given set of dates will be done through a similar query (yes: just one query)

In turn, this opens the door for a low-latency, low memory-usage, high-performance, declarative approach to data retrieval.